STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

Maryam Mousavi, Elena Steiner, Steven Corman, Scott Ruston, Dylan Weber, Hasan Davulcu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic's specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

Original languageEnglish (US)
Title of host publication2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
PublisherAssociation for Computing Machinery
Pages115-123
Number of pages9
ISBN (Electronic)9781450387354
DOIs
StatePublished - Dec 17 2021
Event5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021 - Virtual, Online, China
Duration: Dec 17 2021Dec 20 2021

Publication series

NameACM International Conference Proceeding Series

Conference

Conference5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021
Country/TerritoryChina
CityVirtual, Online
Period12/17/2112/20/21

Keywords

  • Taxonomy induction
  • Text categorization
  • Topic detection

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering'. Together they form a unique fingerprint.

Cite this