Abstract

Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages26-37
Number of pages12
Volume6634 LNAI
EditionPART 1
DOIs
StatePublished - 2011
Event15th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2011 - Shenzhen, China
Duration: May 24 2011May 27 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6634 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2011
CountryChina
CityShenzhen
Period5/24/115/27/11

Fingerprint

Text Classification
Feature Selection
Feature extraction
Selection Model
Feature Model
Empirical Analysis
Experiments
Verify
Optimization Problem
Strategy
Zero
Experiment

Keywords

  • Feature Ranking
  • Feature Selection
  • Selection Strategy
  • Text Classification

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Fung, P. C. G., Morstatter, F., & Liu, H. (2011). Feature selection strategy in text classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 1 ed., Vol. 6634 LNAI, pp. 26-37). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6634 LNAI, No. PART 1). https://doi.org/10.1007/978-3-642-20841-6-3

Feature selection strategy in text classification. / Fung, Pui Cheong Gabriel; Morstatter, Fred; Liu, Huan.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6634 LNAI PART 1. ed. 2011. p. 26-37 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6634 LNAI, No. PART 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Fung, PCG, Morstatter, F & Liu, H 2011, Feature selection strategy in text classification. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 edn, vol. 6634 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 1, vol. 6634 LNAI, pp. 26-37, 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2011, Shenzhen, China, 5/24/11. https://doi.org/10.1007/978-3-642-20841-6-3
Fung PCG, Morstatter F, Liu H. Feature selection strategy in text classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 ed. Vol. 6634 LNAI. 2011. p. 26-37. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). https://doi.org/10.1007/978-3-642-20841-6-3
Fung, Pui Cheong Gabriel ; Morstatter, Fred ; Liu, Huan. / Feature selection strategy in text classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6634 LNAI PART 1. ed. 2011. pp. 26-37 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).
@inproceedings{fd0014412ea34332b0db3355905b80ee,
title = "Feature selection strategy in text classification",
abstract = "Traditionally, the best number of features is determined by the so-called {"}rule of thumb{"}, or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.",
keywords = "Feature Ranking, Feature Selection, Selection Strategy, Text Classification",
author = "Fung, {Pui Cheong Gabriel} and Fred Morstatter and Huan Liu",
year = "2011",
doi = "10.1007/978-3-642-20841-6-3",
language = "English (US)",
isbn = "9783642208409",
volume = "6634 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
number = "PART 1",
pages = "26--37",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
edition = "PART 1",

}

TY - GEN

T1 - Feature selection strategy in text classification

AU - Fung, Pui Cheong Gabriel

AU - Morstatter, Fred

AU - Liu, Huan

PY - 2011

Y1 - 2011

N2 - Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

AB - Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

KW - Feature Ranking

KW - Feature Selection

KW - Selection Strategy

KW - Text Classification

UR - http://www.scopus.com/inward/record.url?scp=79957959320&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957959320&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-20841-6-3

DO - 10.1007/978-3-642-20841-6-3

M3 - Conference contribution

AN - SCOPUS:79957959320

SN - 9783642208409

VL - 6634 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 26

EP - 37

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -