STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

Maryam Mousavi; Elena Steiner; Steven Corman; Scott Ruston; Dylan Weber; Hasan Davulcu

doi:10.1145/3508230.3508247

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

Maryam Mousavi, Elena Steiner, Steven Corman, Scott Ruston, Dylan Weber, Hasan Davulcu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

Original language	English (US)
Title of host publication	2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
Publisher	Association for Computing Machinery
Pages	115-123
Number of pages	9
ISBN (Electronic)	9781450387354
DOIs	https://doi.org/10.1145/3508230.3508247
State	Published - Dec 17 2021
Event	5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021 - Virtual, Online, China Duration: Dec 17 2021 → Dec 20 2021

Publication series

Name	ACM International Conference Proceeding Series

Conference

Conference	5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
Country/Territory	China
City	Virtual, Online
Period	12/17/21 → 12/20/21

Keywords

Taxonomy induction
Text categorization
Topic detection

ASJC Scopus subject areas

Software
Human-Computer Interaction
Computer Vision and Pattern Recognition
Computer Networks and Communications

Access to Document

10.1145/3508230.3508247

Cite this

Mousavi, M., Steiner, E., Corman, S., Ruston, S., Weber, D., & Davulcu, H. (2021). STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering. In 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021 (pp. 115-123). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3508230.3508247

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering. / Mousavi, Maryam; Steiner, Elena; Corman, Steven et al.
2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021. Association for Computing Machinery, 2021. p. 115-123 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Mousavi, M, Steiner, E, Corman, S , Ruston, S, Weber, D & Davulcu, H 2021, STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering. in 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 115-123, 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021, Virtual, Online, China, 12/17/21. https://doi.org/10.1145/3508230.3508247

Mousavi M, Steiner E, Corman S , Ruston S, Weber D, Davulcu H. STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering. In 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021. Association for Computing Machinery. 2021. p. 115-123. (ACM International Conference Proceeding Series). doi: 10.1145/3508230.3508247

@inproceedings{9a27598809e94e34bee2948afaeb4de4,

title = "STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering",

abstract = "In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.",

keywords = "Taxonomy induction, Text categorization, Topic detection",

author = "Maryam Mousavi and Elena Steiner and Steven Corman and Scott Ruston and Dylan Weber and Hasan Davulcu",

note = "Funding Information: ∗This research was supported by a grant from the U.S. Office of Naval Research (N00014-18-1-2692) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. NLPIR 2021, December 17–20, 2021, Sanya, Hance Provenience, China {\textcopyright} 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8735-4/21/12...$15.00 https://doi.org/10.1145/3508230.3508247 Publisher Copyright: {\textcopyright} 2021 ACM.; 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021 ; Conference date: 17-12-2021 Through 20-12-2021",

year = "2021",

month = dec,

day = "17",

doi = "10.1145/3508230.3508247",

language = "English (US)",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery",

pages = "115--123",

booktitle = "2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021",

}

TY - GEN

T1 - STIF

T2 - 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021

AU - Mousavi, Maryam

AU - Steiner, Elena

AU - Corman, Steven

AU - Ruston, Scott

AU - Weber, Dylan

AU - Davulcu, Hasan

N1 - Funding Information: ∗This research was supported by a grant from the U.S. Office of Naval Research (N00014-18-1-2692) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. NLPIR 2021, December 17–20, 2021, Sanya, Hance Provenience, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8735-4/21/12...$15.00 https://doi.org/10.1145/3508230.3508247 Publisher Copyright: © 2021 ACM.

PY - 2021/12/17

Y1 - 2021/12/17

N2 - In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

AB - In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

KW - Taxonomy induction

KW - Text categorization

KW - Topic detection

UR - http://www.scopus.com/inward/record.url?scp=85127293230&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85127293230&partnerID=8YFLogxK

U2 - 10.1145/3508230.3508247

DO - 10.1145/3508230.3508247

M3 - Conference contribution

AN - SCOPUS:85127293230

T3 - ACM International Conference Proceeding Series

SP - 115

EP - 123

BT - 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021

PB - Association for Computing Machinery

Y2 - 17 December 2021 through 20 December 2021

ER -

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this