Pre-processing of high-dimensional categorical predictors in classification settings

Eugene Tuv, George Runger

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

Original languageEnglish (US)
Pages (from-to)419-429
Number of pages11
JournalApplied Artificial Intelligence
Volume17
Issue number5-6
StatePublished - May 2003

Fingerprint

Decision trees
Learning algorithms
Industrial applications
Support vector machines
Semiconductor materials
Processing

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Cite this

Pre-processing of high-dimensional categorical predictors in classification settings. / Tuv, Eugene; Runger, George.

In: Applied Artificial Intelligence, Vol. 17, No. 5-6, 05.2003, p. 419-429.

Research output: Contribution to journalArticle

@article{7f5d47447728459ea30df5df9061140f,
title = "Pre-processing of high-dimensional categorical predictors in classification settings",
abstract = "Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.",
author = "Eugene Tuv and George Runger",
year = "2003",
month = "5",
language = "English (US)",
volume = "17",
pages = "419--429",
journal = "Applied Artificial Intelligence",
issn = "0883-9514",
publisher = "Taylor and Francis Ltd.",
number = "5-6",

}

TY - JOUR

T1 - Pre-processing of high-dimensional categorical predictors in classification settings

AU - Tuv, Eugene

AU - Runger, George

PY - 2003/5

Y1 - 2003/5

N2 - Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

AB - Models in industrial applications can encounter categorical predictors with a large number of categories (hundreds or thousands). An example is the lot identifier of product in semiconductor manufacturing. Such variables represent a serious problem for practically all modern classification techniques. The goal is an efficient, computationally fast way to discover a small number of natural partitions of values for such variables that have similar statistical properties in terms of categorical response. Such partitions (interesting by itself) can be used then as an input to standard learning algorithms, such as decision trees, support vector machines, etc. The proposed approach introduces a data transformation on derived sparse frequency tables. Application of even simplest non-hierarchical metric clustering method to the transformed coordinates shows significant improvement both in speed and quality of partition in comparison to currently used methods.

UR - http://www.scopus.com/inward/record.url?scp=0242544039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0242544039&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:0242544039

VL - 17

SP - 419

EP - 429

JO - Applied Artificial Intelligence

JF - Applied Artificial Intelligence

SN - 0883-9514

IS - 5-6

ER -