Comparing count-based and band-based indices of word frequency

Implications for active vocabulary research and pedagogical applications

Scott A. Crossley, Tom Cobb, Danielle McNamara

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58% of the texts while band-based indices reported accuracies that were between 10% and 22% lower than count-based indices.

Original languageEnglish (US)
Pages (from-to)965-981
Number of pages17
JournalSystem
Volume41
Issue number4
DOIs
StatePublished - Dec 2013

Fingerprint

vocabulary
Active Vocabulary
Word Frequency
language
learning
Group
Proficiency
Native Speaker
L2 Learners

Keywords

  • Active and passive lexical proficiency
  • Band-based frequency measures
  • Computational linguistics
  • Count-based frequency measures
  • Frequency analysis
  • Frequency lists
  • Learner corpora
  • Lexical sophistication

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language

Cite this

Comparing count-based and band-based indices of word frequency : Implications for active vocabulary research and pedagogical applications. / Crossley, Scott A.; Cobb, Tom; McNamara, Danielle.

In: System, Vol. 41, No. 4, 12.2013, p. 965-981.

Research output: Contribution to journalArticle

@article{4e90d09f9d884bcdbd4998ab3b10daf0,
title = "Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications",
abstract = "In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58{\%} of the texts while band-based indices reported accuracies that were between 10{\%} and 22{\%} lower than count-based indices.",
keywords = "Active and passive lexical proficiency, Band-based frequency measures, Computational linguistics, Count-based frequency measures, Frequency analysis, Frequency lists, Learner corpora, Lexical sophistication",
author = "Crossley, {Scott A.} and Tom Cobb and Danielle McNamara",
year = "2013",
month = "12",
doi = "10.1016/j.system.2013.08.002",
language = "English (US)",
volume = "41",
pages = "965--981",
journal = "System",
issn = "0346-251X",
publisher = "Elsevier Limited",
number = "4",

}

TY - JOUR

T1 - Comparing count-based and band-based indices of word frequency

T2 - Implications for active vocabulary research and pedagogical applications

AU - Crossley, Scott A.

AU - Cobb, Tom

AU - McNamara, Danielle

PY - 2013/12

Y1 - 2013/12

N2 - In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58% of the texts while band-based indices reported accuracies that were between 10% and 22% lower than count-based indices.

AB - In assessments of second language (L2) writing, quality of lexis typically claims more variance than other factors, and the most readily operationalized measure of lexical quality is word frequency. This study compares two methods of automatically assessing word frequency in learner productions. The first method, a band-based method, involves lexical frequency profiling, a procedure that first groups individual words into families and then sorts these into corpus-based frequency bands. The second method, a count-based method, assigns a normalized corpus frequency count to each individual word form used, yielding an average count for a text. Both band and count-based methods were used to analyze 100 L2 learner and 30 native speaker freewrites that had been classified according to proficiency level (i.e., native speakers and beginning, intermediate and advanced L2 learners). Machine learning algorithms were used to classify the texts into their respective proficiency levels with results indicating that count-based word frequency indices accurately classified 58% of the texts while band-based indices reported accuracies that were between 10% and 22% lower than count-based indices.

KW - Active and passive lexical proficiency

KW - Band-based frequency measures

KW - Computational linguistics

KW - Count-based frequency measures

KW - Frequency analysis

KW - Frequency lists

KW - Learner corpora

KW - Lexical sophistication

UR - http://www.scopus.com/inward/record.url?scp=84886793009&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886793009&partnerID=8YFLogxK

U2 - 10.1016/j.system.2013.08.002

DO - 10.1016/j.system.2013.08.002

M3 - Article

VL - 41

SP - 965

EP - 981

JO - System

JF - System

SN - 0346-251X

IS - 4

ER -