Abstract

Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.

LanguageEnglish (US)
Title of host publicationCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery, Inc
Pages2033-2035
Number of pages3
ISBN (Print)9781450325981
DOIs
StatePublished - Nov 3 2014
Event23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
Duration: Nov 3 2014Nov 7 2014

Other

Other23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
CountryChina
CityShanghai
Period11/3/1411/7/14

Fingerprint

Labels
Labeling
Learning systems
Scalability
Costs
Wikipedia
Information extraction
Paradigm

Keywords

  • Cluster visualization
  • Knowledge extraction
  • Summarization

ASJC Scopus subject areas

  • Information Systems and Management
  • Computer Science Applications
  • Information Systems

Cite this

Zhang, K., Xiao, Y., Tong, H., Wang, H., & Wang, W. (2014). WiiCluster: A platform for wikipedia infobox generation. In CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management (pp. 2033-2035). Association for Computing Machinery, Inc. https://doi.org/10.1145/2661829.2661840

WiiCluster : A platform for wikipedia infobox generation. / Zhang, Kezun; Xiao, Yanghua; Tong, Hanghang; Wang, Haixun; Wang, Wei.

CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. p. 2033-2035.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, K, Xiao, Y, Tong, H, Wang, H & Wang, W 2014, WiiCluster: A platform for wikipedia infobox generation. in CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, pp. 2033-2035, 23rd ACM International Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, 11/3/14. https://doi.org/10.1145/2661829.2661840
Zhang K, Xiao Y, Tong H, Wang H, Wang W. WiiCluster: A platform for wikipedia infobox generation. In CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc. 2014. p. 2033-2035 https://doi.org/10.1145/2661829.2661840
Zhang, Kezun ; Xiao, Yanghua ; Tong, Hanghang ; Wang, Haixun ; Wang, Wei. / WiiCluster : A platform for wikipedia infobox generation. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. pp. 2033-2035
@inproceedings{8439d2bd96d84570a62cc9f2cab731a0,
title = "WiiCluster: A platform for wikipedia infobox generation",
abstract = "Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.",
keywords = "Cluster visualization, Knowledge extraction, Summarization",
author = "Kezun Zhang and Yanghua Xiao and Hanghang Tong and Haixun Wang and Wei Wang",
year = "2014",
month = "11",
day = "3",
doi = "10.1145/2661829.2661840",
language = "English (US)",
isbn = "9781450325981",
pages = "2033--2035",
booktitle = "CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - WiiCluster

T2 - A platform for wikipedia infobox generation

AU - Zhang, Kezun

AU - Xiao, Yanghua

AU - Tong, Hanghang

AU - Wang, Haixun

AU - Wang, Wei

PY - 2014/11/3

Y1 - 2014/11/3

N2 - Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.

AB - Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost. We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.

KW - Cluster visualization

KW - Knowledge extraction

KW - Summarization

UR - http://www.scopus.com/inward/record.url?scp=84937555263&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937555263&partnerID=8YFLogxK

U2 - 10.1145/2661829.2661840

DO - 10.1145/2661829.2661840

M3 - Conference contribution

SN - 9781450325981

SP - 2033

EP - 2035

BT - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management

PB - Association for Computing Machinery, Inc

ER -