Feature selection for clustering - A filter solution

Manoranjan Dash, Kiseok Choi, Peter Scheuermann, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

245 Citations (Scopus)

Abstract

Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE International Conference on Data Mining, ICDM
Pages115-122
Number of pages8
StatePublished - 2002
Event2nd IEEE International Conference on Data Mining, ICDM '02 - Maebashi, Japan
Duration: Dec 9 2002Dec 12 2002

Other

Other2nd IEEE International Conference on Data Mining, ICDM '02
CountryJapan
CityMaebashi
Period12/9/0212/12/02

Fingerprint

Clustering algorithms
Feature extraction
Entropy
Processing
Set theory

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Dash, M., Choi, K., Scheuermann, P., & Liu, H. (2002). Feature selection for clustering - A filter solution. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 115-122)

Feature selection for clustering - A filter solution. / Dash, Manoranjan; Choi, Kiseok; Scheuermann, Peter; Liu, Huan.

Proceedings - IEEE International Conference on Data Mining, ICDM. 2002. p. 115-122.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dash, M, Choi, K, Scheuermann, P & Liu, H 2002, Feature selection for clustering - A filter solution. in Proceedings - IEEE International Conference on Data Mining, ICDM. pp. 115-122, 2nd IEEE International Conference on Data Mining, ICDM '02, Maebashi, Japan, 12/9/02.
Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering - A filter solution. In Proceedings - IEEE International Conference on Data Mining, ICDM. 2002. p. 115-122
Dash, Manoranjan ; Choi, Kiseok ; Scheuermann, Peter ; Liu, Huan. / Feature selection for clustering - A filter solution. Proceedings - IEEE International Conference on Data Mining, ICDM. 2002. pp. 115-122
@inproceedings{1c7bfc5bde4a436789b93fea86583ac6,
title = "Feature selection for clustering - A filter solution",
abstract = "Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.",
author = "Manoranjan Dash and Kiseok Choi and Peter Scheuermann and Huan Liu",
year = "2002",
language = "English (US)",
isbn = "0769517544",
pages = "115--122",
booktitle = "Proceedings - IEEE International Conference on Data Mining, ICDM",

}

TY - GEN

T1 - Feature selection for clustering - A filter solution

AU - Dash, Manoranjan

AU - Choi, Kiseok

AU - Scheuermann, Peter

AU - Liu, Huan

PY - 2002

Y1 - 2002

N2 - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

AB - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

UR - http://www.scopus.com/inward/record.url?scp=78149289039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78149289039&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:78149289039

SN - 0769517544

SN - 9780769517544

SP - 115

EP - 122

BT - Proceedings - IEEE International Conference on Data Mining, ICDM

ER -