TY - JOUR
T1 - Age of Exposure 2.0
T2 - Estimating word complexity using iterative models of word embeddings
AU - Botarleanu, Robert Mihai
AU - Dascalu, Mihai
AU - Watanabe, Micah
AU - Crossley, Scott Andrew
AU - McNamara, Danielle S.
N1 - Funding Information:
This research was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS – UEFISCDI, project number TE 70 PN-III-P1-1.1-TE-2019-2209, ATES – “Automated Text Evaluation and Simplification,” the Institute of Education Sciences (R305A180144 and R305A180261), and the Office of Naval Research (N00014-17-1-2300; N00014-20-1-2623). The opinions expressed are those of the authors and do not represent views of the IES or ONR. We would also like to thank Prof. Peter Foltz for providing the Word Maturity indices that were used as a baseline in this paper.
Publisher Copyright:
© 2022, The Psychonomic Society, Inc.
PY - 2022/12
Y1 - 2022/12
N2 - Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase the cost and effort to produce. In this paper, we introduce Age of Exposure (AoE) version 2, a proxy for human exposure to new vocabulary terms that expands AoA word lists through training regressors to predict AoA scores. Word2vec word embeddings are trained on cumulatively increasing corpora of texts, word exposure trajectories are generated by aligning the word2vec vector spaces, and features of words are derived for modeling AoA scores. Our prediction models achieve low errors (from 13% with a corresponding R2 of.35 up to 7% with an R2 of.74), can be uniformly applied to different AoA word lists, and generalize to the entire vocabulary of a language. Our method benefits from using existing readability indices to define the order of texts in the corpora, while the performed analyses confirm that the generated AoA scores accurately predicted the difficulty of texts (R2 of.84, surpassing related previous work). Further, we provide evidence of the internal reliability of our word trajectory features, demonstrate the effectiveness of the word trajectory features when contrasted with simple lexical features, and show that the exclusion of features that rely on external resources does not significantly impact performance.
AB - Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase the cost and effort to produce. In this paper, we introduce Age of Exposure (AoE) version 2, a proxy for human exposure to new vocabulary terms that expands AoA word lists through training regressors to predict AoA scores. Word2vec word embeddings are trained on cumulatively increasing corpora of texts, word exposure trajectories are generated by aligning the word2vec vector spaces, and features of words are derived for modeling AoA scores. Our prediction models achieve low errors (from 13% with a corresponding R2 of.35 up to 7% with an R2 of.74), can be uniformly applied to different AoA word lists, and generalize to the entire vocabulary of a language. Our method benefits from using existing readability indices to define the order of texts in the corpora, while the performed analyses confirm that the generated AoA scores accurately predicted the difficulty of texts (R2 of.84, surpassing related previous work). Further, we provide evidence of the internal reliability of our word trajectory features, demonstrate the effectiveness of the word trajectory features when contrasted with simple lexical features, and show that the exclusion of features that rely on external resources does not significantly impact performance.
KW - Age of acquisition
KW - Age of exposure
KW - Word embeddings
KW - Word exposure
UR - http://www.scopus.com/inward/record.url?scp=85124734837&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124734837&partnerID=8YFLogxK
U2 - 10.3758/s13428-022-01797-5
DO - 10.3758/s13428-022-01797-5
M3 - Article
C2 - 35167112
AN - SCOPUS:85124734837
SN - 1554-351X
VL - 54
SP - 3015
EP - 3042
JO - Behavior Research Methods, Instruments, and Computers
JF - Behavior Research Methods, Instruments, and Computers
IS - 6
ER -