Extracting unknown words from Sina Weibo via data clustering

Kai Lei, Weiyang Zhang, Kai Zhang, Kuai Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.

Original languageEnglish (US)
Title of host publication2015 IEEE International Conference on Communications, ICC 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1182-1187
Number of pages6
ISBN (Electronic)9781467364324
DOIs
StatePublished - Sep 9 2015
EventIEEE International Conference on Communications, ICC 2015 - London, United Kingdom
Duration: Jun 8 2015Jun 12 2015

Publication series

NameIEEE International Conference on Communications
Volume2015-September
ISSN (Print)1550-3607

Other

OtherIEEE International Conference on Communications, ICC 2015
Country/TerritoryUnited Kingdom
CityLondon
Period6/8/156/12/15

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Extracting unknown words from Sina Weibo via data clustering'. Together they form a unique fingerprint.

Cite this