Feature selection strategy in text classification

Pui Cheong Gabriel Fung; Fred Morstatter; Huan Liu

doi:10.1007/978-3-642-20841-6_3

Feature selection strategy in text classification

Pui Cheong Gabriel Fung, Fred Morstatter, Huan Liu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

Original language	English (US)
Title of host publication	Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings
Publisher	Springer Verlag
Pages	26-37
Number of pages	12
Edition	PART 1
ISBN (Print)	9783642208409
DOIs	https://doi.org/10.1007/978-3-642-20841-6_3
State	Published - 2011

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Number	PART 1
Volume	6634 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Keywords

Feature Ranking
Feature Selection
Selection Strategy
Text Classification

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-642-20841-6_3

Cite this

Fung, P. C. G., Morstatter, F., & Liu, H. (2011). Feature selection strategy in text classification. In Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings (PART 1 ed., pp. 26-37). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6634 LNAI, No. PART 1). Springer Verlag. https://doi.org/10.1007/978-3-642-20841-6_3

Feature selection strategy in text classification. / Fung, Pui Cheong Gabriel; Morstatter, Fred; Liu, Huan.
Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings. PART 1. ed. Springer Verlag, 2011. p. 26-37 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6634 LNAI, No. PART 1).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Fung, PCG, Morstatter, F & Liu, H 2011, Feature selection strategy in text classification. in Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings. PART 1 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 1, vol. 6634 LNAI, Springer Verlag, pp. 26-37. https://doi.org/10.1007/978-3-642-20841-6_3

Fung PCG, Morstatter F, Liu H. Feature selection strategy in text classification. In Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings. PART 1 ed. Springer Verlag. 2011. p. 26-37. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). doi: 10.1007/978-3-642-20841-6_3

Fung, Pui Cheong Gabriel ; Morstatter, Fred ; Liu, Huan. / Feature selection strategy in text classification. Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings. PART 1. ed. Springer Verlag, 2011. pp. 26-37 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).

@inproceedings{fd0014412ea34332b0db3355905b80ee,

title = "Feature selection strategy in text classification",

abstract = "Traditionally, the best number of features is determined by the so-called {"}rule of thumb{"}, or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.",

keywords = "Feature Ranking, Feature Selection, Selection Strategy, Text Classification",

author = "Fung, {Pui Cheong Gabriel} and Fred Morstatter and Huan Liu",

year = "2011",

doi = "10.1007/978-3-642-20841-6_3",

language = "English (US)",

isbn = "9783642208409",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

number = "PART 1",

pages = "26--37",

booktitle = "Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings",

edition = "PART 1",

}

TY - GEN

T1 - Feature selection strategy in text classification

AU - Fung, Pui Cheong Gabriel

AU - Morstatter, Fred

AU - Liu, Huan

PY - 2011

Y1 - 2011

N2 - Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

AB - Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

KW - Feature Ranking

KW - Feature Selection

KW - Selection Strategy

KW - Text Classification

UR - http://www.scopus.com/inward/record.url?scp=79957959320&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957959320&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-20841-6_3

DO - 10.1007/978-3-642-20841-6_3

M3 - Conference contribution

AN - SCOPUS:79957959320

SN - 9783642208409

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 26

EP - 37

BT - Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings

PB - Springer Verlag

ER -

Feature selection strategy in text classification

Abstract

Publication series

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this