Pre-processing of high-dimensional categorical predictors in classification settings

Eugene Tuv; George Runger

doi:10.1080/713827172

Pre-processing of high-dimensional categorical predictors in classification settings

Eugene Tuv, George Runger

Industrial, Systems and Operations Engineering

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

Original language	English (US)
Pages (from-to)	419-429
Number of pages	11
Journal	Applied Artificial Intelligence
Volume	17
Issue number	5-6
DOIs	https://doi.org/10.1080/713827172
State	Published - May 1 2003

ASJC Scopus subject areas

Artificial Intelligence

Access to Document

10.1080/713827172

Cite this

@article{7f5d47447728459ea30df5df9061140f,

title = "Pre-processing of high-dimensional categorical predictors in classification settings",

abstract = "Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.",

author = "Eugene Tuv and George Runger",

year = "2003",

month = may,

day = "1",

doi = "10.1080/713827172",

language = "English (US)",

volume = "17",

pages = "419--429",

journal = "Applied Artificial Intelligence",

issn = "0883-9514",

publisher = "Taylor and Francis Ltd.",

number = "5-6",

}

TY - JOUR

T1 - Pre-processing of high-dimensional categorical predictors in classification settings

AU - Tuv, Eugene

AU - Runger, George

PY - 2003/5/1

Y1 - 2003/5/1

N2 - Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

AB - Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

UR - http://www.scopus.com/inward/record.url?scp=0242544039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0242544039&partnerID=8YFLogxK

U2 - 10.1080/713827172

DO - 10.1080/713827172

M3 - Article

AN - SCOPUS:0242544039

SN - 0883-9514

VL - 17

SP - 419

EP - 429

JO - Applied Artificial Intelligence

JF - Applied Artificial Intelligence

IS - 5-6

ER -

Pre-processing of high-dimensional categorical predictors in classification settings

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this