Discretization: An enabling technique

Huan Liu; Farhad Hussain; Chew Lim Tan; Manoranjan Dash

doi:10.1023/A:1016304305535

Discretization: An enabling technique

Huan Liu, Farhad Hussain, Chew Lim Tan, Manoranjan Dash

Research output: Contribution to journal › Article › peer-review

774 Scopus citations

Abstract

Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.

Original language	English (US)
Pages (from-to)	393-423
Number of pages	31
Journal	Data Mining and Knowledge Discovery
Volume	6
Issue number	4
DOIs	https://doi.org/10.1023/A:1016304305535
State	Published - 2002
Externally published	Yes

Keywords

Classification
Continuous feature
Data mining
Discretization

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Computer Networks and Communications

Access to Document

10.1023/A:1016304305535

Cite this

@article{be84d269a19a46b7a61bd4562a6e644a,

title = "Discretization: An enabling technique",

abstract = "Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.",

keywords = "Classification, Continuous feature, Data mining, Discretization",

author = "Huan Liu and Farhad Hussain and Tan, {Chew Lim} and Manoranjan Dash",

year = "2002",

doi = "10.1023/A:1016304305535",

language = "English (US)",

volume = "6",

pages = "393--423",

journal = "Data Mining and Knowledge Discovery",

issn = "1384-5810",

publisher = "Springer Netherlands",

number = "4",

}

TY - JOUR

T1 - Discretization

T2 - An enabling technique

AU - Liu, Huan

AU - Hussain, Farhad

AU - Tan, Chew Lim

AU - Dash, Manoranjan

PY - 2002

Y1 - 2002

N2 - Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.

AB - Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.

KW - Classification

KW - Continuous feature

KW - Data mining

KW - Discretization

UR - http://www.scopus.com/inward/record.url?scp=0141688369&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0141688369&partnerID=8YFLogxK

U2 - 10.1023/A:1016304305535

DO - 10.1023/A:1016304305535

M3 - Article

AN - SCOPUS:0141688369

SN - 1384-5810

VL - 6

SP - 393

EP - 423

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

IS - 4

ER -

Discretization: An enabling technique

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this