Bias analysis in text classification for highly skewed data

Lei Tang, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

27 Citations (Scopus)

Abstract

Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naïve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE International Conference on Data Mining, ICDM
Pages781-784
Number of pages4
DOIs
StatePublished - 2005
Event5th IEEE International Conference on Data Mining, ICDM 2005 - Houston, TX, United States
Duration: Nov 27 2005Nov 30 2005

Other

Other5th IEEE International Conference on Data Mining, ICDM 2005
CountryUnited States
CityHouston, TX
Period11/27/0511/30/05

Fingerprint

Feature extraction
Classifiers
Decision trees
Support vector machines
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Tang, L., & Liu, H. (2005). Bias analysis in text classification for highly skewed data. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 781-784). [1565781] https://doi.org/10.1109/ICDM.2005.34

Bias analysis in text classification for highly skewed data. / Tang, Lei; Liu, Huan.

Proceedings - IEEE International Conference on Data Mining, ICDM. 2005. p. 781-784 1565781.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tang, L & Liu, H 2005, Bias analysis in text classification for highly skewed data. in Proceedings - IEEE International Conference on Data Mining, ICDM., 1565781, pp. 781-784, 5th IEEE International Conference on Data Mining, ICDM 2005, Houston, TX, United States, 11/27/05. https://doi.org/10.1109/ICDM.2005.34
Tang L, Liu H. Bias analysis in text classification for highly skewed data. In Proceedings - IEEE International Conference on Data Mining, ICDM. 2005. p. 781-784. 1565781 https://doi.org/10.1109/ICDM.2005.34
Tang, Lei ; Liu, Huan. / Bias analysis in text classification for highly skewed data. Proceedings - IEEE International Conference on Data Mining, ICDM. 2005. pp. 781-784
@inproceedings{03ed544fe5ed489bbbd5f98a9e03dab0,
title = "Bias analysis in text classification for highly skewed data",
abstract = "Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Na{\"i}ve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.",
author = "Lei Tang and Huan Liu",
year = "2005",
doi = "10.1109/ICDM.2005.34",
language = "English (US)",
isbn = "0769522785",
pages = "781--784",
booktitle = "Proceedings - IEEE International Conference on Data Mining, ICDM",

}

TY - GEN

T1 - Bias analysis in text classification for highly skewed data

AU - Tang, Lei

AU - Liu, Huan

PY - 2005

Y1 - 2005

N2 - Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naïve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.

AB - Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naïve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.

UR - http://www.scopus.com/inward/record.url?scp=34548548958&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548548958&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2005.34

DO - 10.1109/ICDM.2005.34

M3 - Conference contribution

AN - SCOPUS:34548548958

SN - 0769522785

SN - 9780769522784

SP - 781

EP - 784

BT - Proceedings - IEEE International Conference on Data Mining, ICDM

ER -