How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

Jerid Francom, Amy LaCross, Adam Ussishkin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
EditorsDaniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner
PublisherEuropean Language Resources Association (ELRA)
Pages421-427
Number of pages7
ISBN (Electronic)2951740867, 9782951740860
StatePublished - Jan 1 2010
Externally publishedYes
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: May 17 2010May 23 2010

Other

Other7th International Conference on Language Resources and Evaluation, LREC 2010
CountryMalta
CityValletta
Period5/17/105/23/10

Fingerprint

Maltese
psycholinguistics
language
evaluation
resources
statistical method
rating
Language
Representativeness
Evaluation
Specialized Corpora
linguistics
lack
Psycholinguistics
Resources

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this

Francom, J., LaCross, A., & Ussishkin, A. (2010). How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. In D. Tapias, I. Russo, O. Hamon, S. Piperidis, N. Calzolari, K. Choukri, J. Mariani, H. Mazo, B. Maegaard, J. Odijk, ... M. Rosner (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (pp. 421-427). European Language Resources Association (ELRA).

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. / Francom, Jerid; LaCross, Amy; Ussishkin, Adam.

Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. ed. / Daniel Tapias; Irene Russo; Olivier Hamon; Stelios Piperidis; Nicoletta Calzolari; Khalid Choukri; Joseph Mariani; Helene Mazo; Bente Maegaard; Jan Odijk; Mike Rosner. European Language Resources Association (ELRA), 2010. p. 421-427.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Francom, J, LaCross, A & Ussishkin, A 2010, How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. in D Tapias, I Russo, O Hamon, S Piperidis, N Calzolari, K Choukri, J Mariani, H Mazo, B Maegaard, J Odijk & M Rosner (eds), Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA), pp. 421-427, 7th International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 5/17/10.
Francom J, LaCross A, Ussishkin A. How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. In Tapias D, Russo I, Hamon O, Piperidis S, Calzolari N, Choukri K, Mariani J, Mazo H, Maegaard B, Odijk J, Rosner M, editors, Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA). 2010. p. 421-427
Francom, Jerid ; LaCross, Amy ; Ussishkin, Adam. / How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. editor / Daniel Tapias ; Irene Russo ; Olivier Hamon ; Stelios Piperidis ; Nicoletta Calzolari ; Khalid Choukri ; Joseph Mariani ; Helene Mazo ; Bente Maegaard ; Jan Odijk ; Mike Rosner. European Language Resources Association (ELRA), 2010. pp. 421-427
@inproceedings{79c4119d5b394c0ebf1acf2853099f83,
title = "How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese",
abstract = "In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.",
author = "Jerid Francom and Amy LaCross and Adam Ussishkin",
year = "2010",
month = "1",
day = "1",
language = "English (US)",
pages = "421--427",
editor = "Daniel Tapias and Irene Russo and Olivier Hamon and Stelios Piperidis and Nicoletta Calzolari and Khalid Choukri and Joseph Mariani and Helene Mazo and Bente Maegaard and Jan Odijk and Mike Rosner",
booktitle = "Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

AU - Francom, Jerid

AU - LaCross, Amy

AU - Ussishkin, Adam

PY - 2010/1/1

Y1 - 2010/1/1

N2 - In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

AB - In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

UR - http://www.scopus.com/inward/record.url?scp=84944679519&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944679519&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84944679519

SP - 421

EP - 427

BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010

A2 - Tapias, Daniel

A2 - Russo, Irene

A2 - Hamon, Olivier

A2 - Piperidis, Stelios

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Maegaard, Bente

A2 - Odijk, Jan

A2 - Rosner, Mike

PB - European Language Resources Association (ELRA)

ER -