Pre-processing of high-dimensional categorical predictors in classification settings

Eugene Tuv, George Runger

Research output: Contribution to journalArticlepeer-review

4 Scopus citations


Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

Original languageEnglish (US)
Pages (from-to)419-429
Number of pages11
JournalApplied Artificial Intelligence
Issue number5-6
StatePublished - May 1 2003

ASJC Scopus subject areas

  • Artificial Intelligence


Dive into the research topics of 'Pre-processing of high-dimensional categorical predictors in classification settings'. Together they form a unique fingerprint.

Cite this