Toward multidatabase mining: Identifying relevant databases

Huan Liu; Hongjun Lu; Jun Yao

doi:10.1109/69.940731

Toward multidatabase mining: Identifying relevant databases

Huan Liu, Hongjun Lu, Jun Yao

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

40 Scopus citations

Abstract

Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application.

Original language	English (US)
Pages (from-to)	541-553
Number of pages	13
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	13
Issue number	4
DOIs	https://doi.org/10.1109/69.940731
State	Published - Jul 2001

Keywords

Data mining
Multiple databases
Query
Relevance measure

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Computational Theory and Mathematics

Access to Document

10.1109/69.940731

Cite this

@article{acb67a293fd1488280f09dc49186c538,

title = "Toward multidatabase mining: Identifying relevant databases",

abstract = "Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application.",

keywords = "Data mining, Multiple databases, Query, Relevance measure",

author = "Huan Liu and Hongjun Lu and Jun Yao",

note = "Funding Information: The NSERC (Natural Science and Engineering Research Council of Canada) research grant database is used. It contains eight tables. We briefly explain the largest table, NSERC table. The table contains information on the amount of awards with 14 attributes as follows: Id, Sysid, Sortname, Dept, Organization-id, Fyr, Compyr, Award, Grant-code, Ctee, Install, Acd3, Discipline-code, and Cnt. To simplify the task of finding meaningful databases, we transform the problem of",

year = "2001",

month = jul,

doi = "10.1109/69.940731",

language = "English (US)",

volume = "13",

pages = "541--553",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "4",

}

TY - JOUR

T1 - Toward multidatabase mining

T2 - Identifying relevant databases

AU - Liu, Huan

AU - Lu, Hongjun

AU - Yao, Jun

N1 - Funding Information: The NSERC (Natural Science and Engineering Research Council of Canada) research grant database is used. It contains eight tables. We briefly explain the largest table, NSERC table. The table contains information on the amount of awards with 14 attributes as follows: Id, Sysid, Sortname, Dept, Organization-id, Fyr, Compyr, Award, Grant-code, Ctee, Install, Acd3, Discipline-code, and Cnt. To simplify the task of finding meaningful databases, we transform the problem of

PY - 2001/7

Y1 - 2001/7

N2 - Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application.

AB - Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application.

KW - Data mining

KW - Multiple databases

KW - Query

KW - Relevance measure

UR - http://www.scopus.com/inward/record.url?scp=0035388008&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035388008&partnerID=8YFLogxK

U2 - 10.1109/69.940731

DO - 10.1109/69.940731

M3 - Article

AN - SCOPUS:0035388008

SN - 1041-4347

VL - 13

SP - 541

EP - 553

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 4

ER -

Toward multidatabase mining: Identifying relevant databases

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this