Abstract
Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application.
Original language | English (US) |
---|---|
Pages (from-to) | 541-553 |
Number of pages | 13 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 13 |
Issue number | 4 |
DOIs | |
State | Published - Jul 2001 |
Keywords
- Data mining
- Multiple databases
- Query
- Relevance measure
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics