TY - GEN
T1 - Supervised multivariate discretization in mixed data with random forests
AU - Berrado, Abdelaziz
AU - Runger, George
PY - 2009
Y1 - 2009
N2 - Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.
AB - Discretizing continuous attributes is necessary before association rules mining or using several inductive learning algorithms with a heterogeneous data space. This data preprocessing step should be carried out with a minimum information loss; that is the mutual information between attributes on the one hand and between attributes and the class labels on the other should not be destroyed. This paper introduces a novel supervised, global and dynamic discretization algorithm, called RFDisc (Random Forests Discretizer). It derives its ability in conserving the data properties from the Random Forests learning algorithm. RFDisc is simple, relatively fast and learns automatically the number of bins into which each continuous attribute is to be discretized. Empirical results indicate that the accuracies of classification algorithms such as CART when used with several data sets are comparable before and after discretization using RFDisc. Furthermore, C5.0 achieves the highest classification accuracy with data discretized with RFDisc when compared with other well known discretization algorithms.
UR - http://www.scopus.com/inward/record.url?scp=70349906552&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70349906552&partnerID=8YFLogxK
U2 - 10.1109/AICCSA.2009.5069327
DO - 10.1109/AICCSA.2009.5069327
M3 - Conference contribution
AN - SCOPUS:70349906552
SN - 9781424438068
T3 - 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009
SP - 211
EP - 217
BT - 2009 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2009
T2 - 7th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA-2009
Y2 - 10 May 2009 through 13 May 2009
ER -