Abstract

Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.

Original languageEnglish (US)
Title of host publication2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009
Pages211-217
Number of pages7
DOIs
StatePublished - 2009
Event7th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA-2009 - Rabat, Morocco
Duration: May 10 2009May 13 2009

Other

Other7th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA-2009
CountryMorocco
CityRabat
Period5/10/095/13/09

Fingerprint

Learning algorithms
Association rules
Bins
Labels

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Electrical and Electronic Engineering

Cite this

Berrado, A., & Runger, G. (2009). Supervised multivariate discretization in mixed data with random forests. In 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009 (pp. 211-217). [5069327] https://doi.org/10.1109/AICCSA.2009.5069327

Supervised multivariate discretization in mixed data with random forests. / Berrado, Abdelaziz; Runger, George.

2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009. 2009. p. 211-217 5069327.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berrado, A & Runger, G 2009, Supervised multivariate discretization in mixed data with random forests. in 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009., 5069327, pp. 211-217, 7th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA-2009, Rabat, Morocco, 5/10/09. https://doi.org/10.1109/AICCSA.2009.5069327
Berrado A, Runger G. Supervised multivariate discretization in mixed data with random forests. In 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009. 2009. p. 211-217. 5069327 https://doi.org/10.1109/AICCSA.2009.5069327
Berrado, Abdelaziz ; Runger, George. / Supervised multivariate discretization in mixed data with random forests. 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009. 2009. pp. 211-217
@inproceedings{15ebd641dc1b429e9044f28b979cec61,
title = "Supervised multivariate discretization in mixed data with random forests",
abstract = "Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.",
author = "Abdelaziz Berrado and George Runger",
year = "2009",
doi = "10.1109/AICCSA.2009.5069327",
language = "English (US)",
isbn = "9781424438068",
pages = "211--217",
booktitle = "2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009",

}

TY - GEN

T1 - Supervised multivariate discretization in mixed data with random forests

AU - Berrado, Abdelaziz

AU - Runger, George

PY - 2009

Y1 - 2009

N2 - Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.

AB - Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.

UR - http://www.scopus.com/inward/record.url?scp=70349906552&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349906552&partnerID=8YFLogxK

U2 - 10.1109/AICCSA.2009.5069327

DO - 10.1109/AICCSA.2009.5069327

M3 - Conference contribution

SN - 9781424438068

SP - 211

EP - 217

BT - 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009

ER -