Feature subset selection bias for classification learning

Surendra K. Singhi; Huan Liu

doi:10.1145/1143844.1143951

Feature subset selection bias for classification learning

Surendra K. Singhi, Huan Liu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

37 Scopus citations

Abstract

Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.

Original language	English (US)
Title of host publication	ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006
Pages	849-856
Number of pages	8
DOIs	https://doi.org/10.1145/1143844.1143951
State	Published - 2006
Event	23rd International Conference on Machine Learning, ICML 2006 - Pittsburgh, PA, United States Duration: Jun 25 2006 → Jun 29 2006

Publication series

Name	ACM International Conference Proceeding Series
Volume	148

Other

Other	23rd International Conference on Machine Learning, ICML 2006
Country/Territory	United States
City	Pittsburgh, PA
Period	6/25/06 → 6/29/06

ASJC Scopus subject areas

Software
Human-Computer Interaction
Computer Vision and Pattern Recognition
Computer Networks and Communications

Access to Document

10.1145/1143844.1143951

Cite this

Feature subset selection bias for classification learning. / Singhi, Surendra K.; Liu, Huan.
ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006. 2006. p. 849-856 (ACM International Conference Proceeding Series; Vol. 148).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Singhi, SK & Liu, H 2006, Feature subset selection bias for classification learning. in ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006. ACM International Conference Proceeding Series, vol. 148, pp. 849-856, 23rd International Conference on Machine Learning, ICML 2006, Pittsburgh, PA, United States, 6/25/06. https://doi.org/10.1145/1143844.1143951

@inproceedings{4c90c7c9de814e9db6cf191bc15ac9c5,

title = "Feature subset selection bias for classification learning",

abstract = "Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.",

author = "Singhi, {Surendra K.} and Huan Liu",

year = "2006",

doi = "10.1145/1143844.1143951",

language = "English (US)",

isbn = "1595933832",

series = "ACM International Conference Proceeding Series",

pages = "849--856",

booktitle = "ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006",

note = "23rd International Conference on Machine Learning, ICML 2006 ; Conference date: 25-06-2006 Through 29-06-2006",

}

TY - GEN

T1 - Feature subset selection bias for classification learning

AU - Singhi, Surendra K.

AU - Liu, Huan

PY - 2006

Y1 - 2006

N2 - Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.

AB - Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in socalled feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.

UR - http://www.scopus.com/inward/record.url?scp=34250694929&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34250694929&partnerID=8YFLogxK

U2 - 10.1145/1143844.1143951

DO - 10.1145/1143844.1143951

M3 - Conference contribution

AN - SCOPUS:34250694929

SN - 1595933832

SN - 9781595933836

T3 - ACM International Conference Proceeding Series

SP - 849

EP - 856

BT - ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006

T2 - 23rd International Conference on Machine Learning, ICML 2006

Y2 - 25 June 2006 through 29 June 2006

ER -

Feature subset selection bias for classification learning

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this