TY - JOUR
T1 - Comparing count-based and band-based indices of word frequency
T2 - Implications for active vocabulary research and pedagogical applications
AU - Crossley, Scott A.
AU - Cobb, Tom
AU - McNamara, Danielle
N1 - Funding Information:
This research was supported in part by the Institute for Education Sciences ( IES R305A080589 and IES R305G20018-02 ). Ideas expressed in this material are those of the authors and do not necessarily reflect the views of the IES. The authors would also like to thank the anonymous reviewers and the editors and staff of System for their support. Lastly, the authors would like to thank Scott Jarvis and Michael Daller for inviting them to the colloquium The validity of vocabulary measures at the 2011 American Association for Applied Linguistics conference from which the ideas in this paper derive.
PY - 2013/12
Y1 - 2013/12
N2 - In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58% of the texts while band-based indices reported accuracies that were between 10% and 22% lower than count-based indices.
AB - In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58% of the texts while band-based indices reported accuracies that were between 10% and 22% lower than count-based indices.
KW - Active and passive lexical proficiency
KW - Band-based frequency measures
KW - Computational linguistics
KW - Count-based frequency measures
KW - Frequency analysis
KW - Frequency lists
KW - Learner corpora
KW - Lexical sophistication
UR - http://www.scopus.com/inward/record.url?scp=84886793009&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84886793009&partnerID=8YFLogxK
U2 - 10.1016/j.system.2013.08.002
DO - 10.1016/j.system.2013.08.002
M3 - Article
AN - SCOPUS:84886793009
VL - 41
SP - 965
EP - 981
JO - System
JF - System
SN - 0346-251X
IS - 4
ER -