Feature selection for clustering - A filter solution

Manoranjan Dash; Kiseok Choi; Peter Scheuermann; Huan Liu

Feature selection for clustering - A filter solution

Manoranjan Dash, Kiseok Choi, Peter Scheuermann, Huan Liu

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

336 Scopus citations

Abstract

Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

Original language	English (US)
Title of host publication	Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002
Pages	115-122
Number of pages	8
State	Published - 2002
Event	2nd IEEE International Conference on Data Mining, ICDM '02 - Maebashi, Japan Duration: Dec 9 2002 → Dec 12 2002

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Other

Other	2nd IEEE International Conference on Data Mining, ICDM '02
Country/Territory	Japan
City	Maebashi
Period	12/9/02 → 12/12/02

ASJC Scopus subject areas

General Engineering

Cite this

@inproceedings{1c7bfc5bde4a436789b93fea86583ac6,

title = "Feature selection for clustering - A filter solution",

abstract = "Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.",

author = "Manoranjan Dash and Kiseok Choi and Peter Scheuermann and Huan Liu",

year = "2002",

language = "English (US)",

isbn = "0769517544",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "115--122",

booktitle = "Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002",

}

TY - GEN

T1 - Feature selection for clustering - A filter solution

AU - Dash, Manoranjan

AU - Choi, Kiseok

AU - Scheuermann, Peter

AU - Liu, Huan

PY - 2002

Y1 - 2002

N2 - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

AB - Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are 'wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

UR - http://www.scopus.com/inward/record.url?scp=78149289039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78149289039&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:78149289039

SN - 0769517544

SN - 9780769517544

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 115

EP - 122

BT - Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002

T2 - 2nd IEEE International Conference on Data Mining, ICDM '02

Y2 - 9 December 2002 through 12 December 2002

ER -

Feature selection for clustering - A filter solution

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this