Feature selection strategy in text classification

Pui Cheong Gabriel Fung, Fred Morstatter, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations

Abstract

Traditionally, the best number of features is determined by the so-called "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

Original languageEnglish (US)
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Proceedings
PublisherSpringer Verlag
Pages26-37
Number of pages12
EditionPART 1
ISBN (Print)9783642208409
DOIs
StatePublished - 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6634 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Keywords

  • Feature Ranking
  • Feature Selection
  • Selection Strategy
  • Text Classification

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Feature selection strategy in text classification'. Together they form a unique fingerprint.

Cite this