TY - GEN
T1 - STIF
T2 - 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
AU - Mousavi, Maryam
AU - Steiner, Elena
AU - Corman, Steven
AU - Ruston, Scott
AU - Weber, Dylan
AU - Davulcu, Hasan
N1 - Funding Information:
∗This research was supported by a grant from the U.S. Office of Naval Research (N00014-18-1-2692) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. NLPIR 2021, December 17–20, 2021, Sanya, Hance Provenience, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8735-4/21/12...$15.00 https://doi.org/10.1145/3508230.3508247
Publisher Copyright:
© 2021 ACM.
PY - 2021/12/17
Y1 - 2021/12/17
N2 - In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.
AB - In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.
KW - Taxonomy induction
KW - Text categorization
KW - Topic detection
UR - http://www.scopus.com/inward/record.url?scp=85127293230&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127293230&partnerID=8YFLogxK
U2 - 10.1145/3508230.3508247
DO - 10.1145/3508230.3508247
M3 - Conference contribution
AN - SCOPUS:85127293230
T3 - ACM International Conference Proceeding Series
SP - 115
EP - 123
BT - 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
PB - Association for Computing Machinery
Y2 - 17 December 2021 through 20 December 2021
ER -