Abstract

Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.

Original languageEnglish (US)
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
StateAccepted/In press - Feb 16 2018

    Fingerprint

Keywords

  • Categorical Attribute
  • Clustering
  • Clustering algorithms
  • Complexity theory
  • Computational efficiency
  • Data mining
  • Ensemble Method
  • High Dimensionality
  • Mixed Attributes
  • Partitioning algorithms
  • Radio frequency
  • Random Forest
  • Static Datasets
  • Training

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this