Modelling classification performance for large data sets: An empirical study

Baohua Gu; Feifang Hu; Huan Liu

doi:10.1007/3-540-47714-4_29

Modelling classification performance for large data sets: An empirical study

Baohua Gu, Feifang Hu, Huan Liu

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

23 Scopus citations

Abstract

For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x^-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.

Original language	English (US)
Title of host publication	Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings
Editors	X. Sean Wang, Ge Yu, Hongjun Lu
Publisher	Springer Verlag
Pages	317-328
Number of pages	12
ISBN (Print)	9783540477143
DOIs	https://doi.org/10.1007/3-540-47714-4_29
State	Published - 2001
Event	2nd International Conference on Web-Age Information Management, WAIM 2001 - Xi’an, China Duration: Jul 9 2001 → Jul 11 2001

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	2118
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	2nd International Conference on Web-Age Information Management, WAIM 2001
Country/Territory	China
City	Xi’an
Period	7/9/01 → 7/11/01

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/3-540-47714-4_29

Cite this

Gu, B., Hu, F., & Liu, H. (2001). Modelling classification performance for large data sets: An empirical study. In X. S. Wang, G. Yu, & H. Lu (Eds.), Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings (pp. 317-328). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2118). Springer Verlag. https://doi.org/10.1007/3-540-47714-4_29

Modelling classification performance for large data sets: An empirical study. / Gu, Baohua; Hu, Feifang; Liu, Huan.
Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings. ed. / X. Sean Wang; Ge Yu; Hongjun Lu. Springer Verlag, 2001. p. 317-328 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2118).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Gu, B, Hu, F & Liu, H 2001, Modelling classification performance for large data sets: An empirical study. in XS Wang, G Yu & H Lu (eds), Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2118, Springer Verlag, pp. 317-328, 2nd International Conference on Web-Age Information Management, WAIM 2001, Xi’an, China, 7/9/01. https://doi.org/10.1007/3-540-47714-4_29

Gu B, Hu F, Liu H. Modelling classification performance for large data sets: An empirical study. In Wang XS, Yu G, Lu H, editors, Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings. Springer Verlag. 2001. p. 317-328. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-47714-4_29

Gu, Baohua ; Hu, Feifang ; Liu, Huan. / Modelling classification performance for large data sets : An empirical study. Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings. editor / X. Sean Wang ; Ge Yu ; Hongjun Lu. Springer Verlag, 2001. pp. 317-328 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{bf579f8a922a42a6979042fe612796d5,

title = "Modelling classification performance for large data sets: An empirical study",

abstract = "For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.",

author = "Baohua Gu and Feifang Hu and Huan Liu",

note = "Publisher Copyright: {\textcopyright} Springer-Verlag Berlin Heidelberg 2001.; 2nd International Conference on Web-Age Information Management, WAIM 2001 ; Conference date: 09-07-2001 Through 11-07-2001",

year = "2001",

doi = "10.1007/3-540-47714-4_29",

language = "English (US)",

isbn = "9783540477143",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "317--328",

editor = "Wang, {X. Sean} and Ge Yu and Hongjun Lu",

booktitle = "Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings",

}

TY - GEN

T1 - Modelling classification performance for large data sets

T2 - 2nd International Conference on Web-Age Information Management, WAIM 2001

AU - Gu, Baohua

AU - Hu, Feifang

AU - Liu, Huan

PY - 2001

Y1 - 2001

N2 - For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.

AB - For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms¾C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a-b*x-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.

UR - http://www.scopus.com/inward/record.url?scp=84974711038&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84974711038&partnerID=8YFLogxK

U2 - 10.1007/3-540-47714-4_29

DO - 10.1007/3-540-47714-4_29

M3 - Conference contribution

AN - SCOPUS:84974711038

SN - 9783540477143

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 317

EP - 328

BT - Advances in Web-Age Information Management - 2nd International Conference, WAIM 2001, Proceedings

A2 - Wang, X. Sean

A2 - Yu, Ge

A2 - Lu, Hongjun

PB - Springer Verlag

Y2 - 9 July 2001 through 11 July 2001

ER -

Modelling classification performance for large data sets: An empirical study

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this