The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis

Mihai Lintean; Cristian Moldovan; Vasile Rus; Danielle McNamara

The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis

Mihai Lintean, Cristian Moldovan, Vasile Rus, Danielle McNamara

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.

Original language	English (US)
Title of host publication	Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23
Pages	235-240
Number of pages	6
State	Published - 2010
Externally published	Yes
Event	23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23 - Daytona Beach, FL, United States Duration: May 19 2010 → May 21 2010

Publication series

Name	Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23

Other

Other	23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23
Country/Territory	United States
City	Daytona Beach, FL
Period	5/19/10 → 5/21/10

ASJC Scopus subject areas

Artificial Intelligence
Control and Systems Engineering

Cite this

Lintean, M., Moldovan, C., Rus, V., & McNamara, D. (2010). The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23 (pp. 235-240). (Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23).

The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. / Lintean, Mihai; Moldovan, Cristian; Rus, Vasile et al.
Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23. 2010. p. 235-240 (Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Lintean, M, Moldovan, C, Rus, V & McNamara, D 2010, The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. in Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23. Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23, pp. 235-240, 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23, Daytona Beach, FL, United States, 5/19/10.

Lintean M, Moldovan C, Rus V, McNamara D. The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23. 2010. p. 235-240. (Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23).

Lintean, Mihai ; Moldovan, Cristian ; Rus, Vasile et al. / The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23. 2010. pp. 235-240 (Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23).

@inproceedings{32470843286d42c98902debec1d307aa,

title = "The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis",

abstract = "In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.",

author = "Mihai Lintean and Cristian Moldovan and Vasile Rus and Danielle McNamara",

year = "2010",

language = "English (US)",

isbn = "9781577354475",

series = "Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23",

pages = "235--240",

booktitle = "Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23",

note = "23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23 ; Conference date: 19-05-2010 Through 21-05-2010",

}

TY - GEN

T1 - The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis

AU - Lintean, Mihai

AU - Moldovan, Cristian

AU - Rus, Vasile

AU - McNamara, Danielle

PY - 2010

Y1 - 2010

N2 - In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.

AB - In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.

UR - http://www.scopus.com/inward/record.url?scp=77957867916&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77957867916&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:77957867916

SN - 9781577354475

T3 - Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23

SP - 235

EP - 240

BT - Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23

T2 - 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23

Y2 - 19 May 2010 through 21 May 2010

ER -

The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this