Sampling from databases using B+Trees

Dimuthu Makawita, Kian Lee Tan, Huan Liu

Research output: Contribution to journalArticlepeer-review

Abstract

Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.

Original languageEnglish (US)
Pages (from-to)359-377
Number of pages19
JournalIntelligent Data Analysis
Volume6
Issue number4
DOIs
StatePublished - 2002

Keywords

  • B Tree
  • quality of samples
  • weighted random sampling

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Sampling from databases using B+Trees'. Together they form a unique fingerprint.

Cite this