Sampling from databases using B+Trees

Dimuthu Makawita, Kian Lee Tan, Huan Liu

Research output: Contribution to journalArticle

Abstract

Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.

Original languageEnglish (US)
Pages (from-to)359-377
Number of pages19
JournalIntelligent Data Analysis
Volume6
Issue number4
StatePublished - 2002

Fingerprint

B-tree
Trees (mathematics)
Random Sampling
Sampling
Inclusion Probabilities
Random walk
Leaves
Vary
Path
Vertex of a graph

Keywords

  • B Tree
  • quality of samples
  • weighted random sampling

ASJC Scopus subject areas

  • Artificial Intelligence
  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition

Cite this

Makawita, D., Tan, K. L., & Liu, H. (2002). Sampling from databases using B+Trees. Intelligent Data Analysis, 6(4), 359-377.

Sampling from databases using B+Trees. / Makawita, Dimuthu; Tan, Kian Lee; Liu, Huan.

In: Intelligent Data Analysis, Vol. 6, No. 4, 2002, p. 359-377.

Research output: Contribution to journalArticle

Makawita, D, Tan, KL & Liu, H 2002, 'Sampling from databases using B+Trees', Intelligent Data Analysis, vol. 6, no. 4, pp. 359-377.
Makawita, Dimuthu ; Tan, Kian Lee ; Liu, Huan. / Sampling from databases using B+Trees. In: Intelligent Data Analysis. 2002 ; Vol. 6, No. 4. pp. 359-377.
@article{56fe4940996b4bd29de5e4720e91e24f,
title = "Sampling from databases using B+Trees",
abstract = "Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.",
keywords = "B Tree, quality of samples, weighted random sampling",
author = "Dimuthu Makawita and Tan, {Kian Lee} and Huan Liu",
year = "2002",
language = "English (US)",
volume = "6",
pages = "359--377",
journal = "Intelligent Data Analysis",
issn = "1088-467X",
publisher = "IOS Press",
number = "4",

}

TY - JOUR

T1 - Sampling from databases using B+Trees

AU - Makawita, Dimuthu

AU - Tan, Kian Lee

AU - Liu, Huan

PY - 2002

Y1 - 2002

N2 - Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.

AB - Sampling techniques are becoming increasingly important for very large databases. However, the problem of obtaining a random sample from index structures has not received much attention. In this paper, we examine sampling techniques for B tree. As the fanout of each node varies, a random walk through the index structure does not produce a good representative sample of the data set. We propose a new technique, called B Tree based Weighted Random Sampling (BTWRS), that alters the inclusion probabilities of records accordingly to allow more records from leaves, along the paths with higher fanouts, to be extracted. We extensively evaluated our method, and the results show that there is an improvement in BTWRS over the existing schemes in terms of the quality of the samples obtained and the efficiency of the sampling process. The proposed method can be readily adopted in existing commercial systems.

KW - B Tree

KW - quality of samples

KW - weighted random sampling

UR - http://www.scopus.com/inward/record.url?scp=84883661893&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883661893&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84883661893

VL - 6

SP - 359

EP - 377

JO - Intelligent Data Analysis

JF - Intelligent Data Analysis

SN - 1088-467X

IS - 4

ER -