A supervised clustering and classification algorithm for mining data with mixed variables

Xiangyang Li, Nong Ye

Research output: Contribution to journalArticle

33 Citations (Scopus)

Abstract

This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.

Original languageEnglish (US)
Pages (from-to)396-406
Number of pages11
JournalIEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans
Volume36
Issue number2
DOIs
StatePublished - Feb 2006

Fingerprint

Classification Algorithm
Clustering Algorithm
Data mining
Data Mining
Numerics
Categorical or nominal
Distance Measure
Incremental Learning
Data Classification
Performance Test
Scalability
Computational complexity
Computational Complexity
Clustering
Prediction

Keywords

  • Classification
  • Clustering
  • Computer intrusion detection
  • Dissimilarity measures

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Human-Computer Interaction
  • Theoretical Computer Science
  • Computational Theory and Mathematics

Cite this

@article{2f89111320f142f482680cb87e864525,
title = "A supervised clustering and classification algorithm for mining data with mixed variables",
abstract = "This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.",
keywords = "Classification, Clustering, Computer intrusion detection, Dissimilarity measures",
author = "Xiangyang Li and Nong Ye",
year = "2006",
month = "2",
doi = "10.1109/TSMCA.2005.853501",
language = "English (US)",
volume = "36",
pages = "396--406",
journal = "IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans",
issn = "1083-4427",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

TY - JOUR

T1 - A supervised clustering and classification algorithm for mining data with mixed variables

AU - Li, Xiangyang

AU - Ye, Nong

PY - 2006/2

Y1 - 2006/2

N2 - This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.

AB - This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms.

KW - Classification

KW - Clustering

KW - Computer intrusion detection

KW - Dissimilarity measures

UR - http://www.scopus.com/inward/record.url?scp=33244468138&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33244468138&partnerID=8YFLogxK

U2 - 10.1109/TSMCA.2005.853501

DO - 10.1109/TSMCA.2005.853501

M3 - Article

AN - SCOPUS:33244468138

VL - 36

SP - 396

EP - 406

JO - IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans

JF - IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans

SN - 1083-4427

IS - 2

ER -