Abstract

Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.

Original languageEnglish (US)
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
StateAccepted/In press - Feb 16 2018

Fingerprint

Clustering algorithms
Data mining
Costs
Experiments

Keywords

  • Categorical Attribute
  • Clustering
  • Clustering algorithms
  • Complexity theory
  • Computational efficiency
  • Data mining
  • Ensemble Method
  • High Dimensionality
  • Mixed Attributes
  • Partitioning algorithms
  • Radio frequency
  • Random Forest
  • Static Datasets
  • Training

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

@article{007de97e8d614d408b94b217e04a2988,
title = "CRAFTER: a Tree-ensemble Clustering Algorithm for Static Datasets with Mixed Attributes and High Dimensionality",
abstract = "Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.",
keywords = "Categorical Attribute, Clustering, Clustering algorithms, Complexity theory, Computational efficiency, Data mining, Ensemble Method, High Dimensionality, Mixed Attributes, Partitioning algorithms, Radio frequency, Random Forest, Static Datasets, Training",
author = "Sangdi Lin and Bahareh Azarnoush and George Runger",
year = "2018",
month = "2",
day = "16",
doi = "10.1109/TKDE.2018.2807444",
language = "English (US)",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - CRAFTER

T2 - a Tree-ensemble Clustering Algorithm for Static Datasets with Mixed Attributes and High Dimensionality

AU - Lin, Sangdi

AU - Azarnoush, Bahareh

AU - Runger, George

PY - 2018/2/16

Y1 - 2018/2/16

N2 - Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.

AB - Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.

KW - Categorical Attribute

KW - Clustering

KW - Clustering algorithms

KW - Complexity theory

KW - Computational efficiency

KW - Data mining

KW - Ensemble Method

KW - High Dimensionality

KW - Mixed Attributes

KW - Partitioning algorithms

KW - Radio frequency

KW - Random Forest

KW - Static Datasets

KW - Training

UR - http://www.scopus.com/inward/record.url?scp=85042184574&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042184574&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2807444

DO - 10.1109/TKDE.2018.2807444

M3 - Article

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

ER -