Abstract

Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

Original languageEnglish (US)
Pages (from-to)88-101
Number of pages14
JournalFrontiers of Computer Science in China
Volume6
Issue number1
DOIs
StatePublished - Feb 2012

Fingerprint

Social Media
Factorization
Data mining
Websites
Clustering
Dissect
Knowledge Integration
Text Clustering
Abbreviation
Acronym
Machine Translation
Text Classification
Matrix Factorization
Data Mining
Language
Text
Framework

Keywords

  • matrix factorization
  • multi-language knowledge
  • short texts
  • social media
  • text representation

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Enriching short text representation in microblog for clustering. / Tang, Jiliang; Wang, Xufei; Gao, Huiji; Hu, Xia; Liu, Huan.

In: Frontiers of Computer Science in China, Vol. 6, No. 1, 02.2012, p. 88-101.

Research output: Contribution to journalArticle

Tang, Jiliang ; Wang, Xufei ; Gao, Huiji ; Hu, Xia ; Liu, Huan. / Enriching short text representation in microblog for clustering. In: Frontiers of Computer Science in China. 2012 ; Vol. 6, No. 1. pp. 88-101.
@article{a6a6ba691a954d1ab9548a32f7538e5a,
title = "Enriching short text representation in microblog for clustering",
abstract = "Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.",
keywords = "matrix factorization, multi-language knowledge, short texts, social media, text representation",
author = "Jiliang Tang and Xufei Wang and Huiji Gao and Xia Hu and Huan Liu",
year = "2012",
month = "2",
doi = "10.1007/s11704-011-1167-7",
language = "English (US)",
volume = "6",
pages = "88--101",
journal = "Frontiers of Computer Science",
issn = "2095-2228",
publisher = "Springer Science + Business Media",
number = "1",

}

TY - JOUR

T1 - Enriching short text representation in microblog for clustering

AU - Tang, Jiliang

AU - Wang, Xufei

AU - Gao, Huiji

AU - Hu, Xia

AU - Liu, Huan

PY - 2012/2

Y1 - 2012/2

N2 - Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

AB - Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

KW - matrix factorization

KW - multi-language knowledge

KW - short texts

KW - social media

KW - text representation

UR - http://www.scopus.com/inward/record.url?scp=84863014790&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863014790&partnerID=8YFLogxK

U2 - 10.1007/s11704-011-1167-7

DO - 10.1007/s11704-011-1167-7

M3 - Article

AN - SCOPUS:84863014790

VL - 6

SP - 88

EP - 101

JO - Frontiers of Computer Science

JF - Frontiers of Computer Science

SN - 2095-2228

IS - 1

ER -