TY - JOUR
T1 - A supervised clustering and classification algorithm for mining data with mixed variables
AU - Li, Xiangyang
AU - Ye, Nong
N1 - Funding Information:
Manuscript received April 27, 2004; revised July 31, 2004. This work was supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant F49620-99-1-001. This paper was recommended by Associate Editor J. Miller.
PY - 2006/2
Y1 - 2006/2
N2 - This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.
AB - This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.
KW - Classification
KW - Clustering
KW - Computer intrusion detection
KW - Dissimilarity measures
UR - http://www.scopus.com/inward/record.url?scp=33244468138&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33244468138&partnerID=8YFLogxK
U2 - 10.1109/TSMCA.2005.853501
DO - 10.1109/TSMCA.2005.853501
M3 - Article
AN - SCOPUS:33244468138
VL - 36
SP - 396
EP - 406
JO - IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans
JF - IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans
SN - 1083-4427
IS - 2
ER -