TY - GEN
T1 - Extracting unknown words from Sina Weibo via data clustering
AU - Lei, Kai
AU - Zhang, Weiyang
AU - Zhang, Kai
AU - Xu, Kuai
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/9/9
Y1 - 2015/9/9
N2 - Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.
AB - Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.
UR - http://www.scopus.com/inward/record.url?scp=84953729067&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84953729067&partnerID=8YFLogxK
U2 - 10.1109/ICC.2015.7248483
DO - 10.1109/ICC.2015.7248483
M3 - Conference contribution
AN - SCOPUS:84953729067
T3 - IEEE International Conference on Communications
SP - 1182
EP - 1187
BT - 2015 IEEE International Conference on Communications, ICC 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Communications, ICC 2015
Y2 - 8 June 2015 through 12 June 2015
ER -