TY - GEN
T1 - WiiCluster
T2 - 23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
AU - Zhang, Kezun
AU - Xiao, Yanghua
AU - Tong, Hanghang
AU - Wang, Haixun
AU - Wang, Wei
PY - 2014/11/3
Y1 - 2014/11/3
N2 - Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.
AB - Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.
KW - Cluster visualization
KW - Knowledge extraction
KW - Summarization
UR - http://www.scopus.com/inward/record.url?scp=84937555263&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84937555263&partnerID=8YFLogxK
U2 - 10.1145/2661829.2661840
DO - 10.1145/2661829.2661840
M3 - Conference contribution
AN - SCOPUS:84937555263
T3 - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
SP - 2033
EP - 2035
BT - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 3 November 2014 through 7 November 2014
ER -