KATARA: Reliable data cleaning with knowledge bases and crowdsourcing

Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

Original languageEnglish (US)
Pages (from-to)1952-1955
Number of pages4
JournalUnknown Journal
Volume8
Issue number12
StatePublished - 2015
Externally publishedYes

Fingerprint

Crowdsourcing
Knowledge Bases
Semantics
Cleaning
Repair
Specifications
Information Storage and Retrieval
Information Systems
Hand

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. Unknown Journal, 8(12), 1952-1955.

KATARA : Reliable data cleaning with knowledge bases and crowdsourcing. / Chu, Xu; Morcos, John; Ilyas, Ihab F.; Ouzzani, Mourad; Papotti, Paolo; Tang, Nan; Ye, Yin.

In: Unknown Journal, Vol. 8, No. 12, 2015, p. 1952-1955.

Research output: Contribution to journalArticle

Chu, X, Morcos, J, Ilyas, IF, Ouzzani, M, Papotti, P, Tang, N & Ye, Y 2015, 'KATARA: Reliable data cleaning with knowledge bases and crowdsourcing', Unknown Journal, vol. 8, no. 12, pp. 1952-1955.
Chu X, Morcos J, Ilyas IF, Ouzzani M, Papotti P, Tang N et al. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. Unknown Journal. 2015;8(12):1952-1955.
Chu, Xu ; Morcos, John ; Ilyas, Ihab F. ; Ouzzani, Mourad ; Papotti, Paolo ; Tang, Nan ; Ye, Yin. / KATARA : Reliable data cleaning with knowledge bases and crowdsourcing. In: Unknown Journal. 2015 ; Vol. 8, No. 12. pp. 1952-1955.
@article{e27146199f3745898d48cad0360fc13a,
title = "KATARA: Reliable data cleaning with knowledge bases and crowdsourcing",
abstract = "Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.",
author = "Xu Chu and John Morcos and Ilyas, {Ihab F.} and Mourad Ouzzani and Paolo Papotti and Nan Tang and Yin Ye",
year = "2015",
language = "English (US)",
volume = "8",
pages = "1952--1955",
journal = "Scanning Electron Microscopy",
issn = "0586-5581",
publisher = "Scanning Microscopy International",
number = "12",

}

TY - JOUR

T1 - KATARA

T2 - Reliable data cleaning with knowledge bases and crowdsourcing

AU - Chu, Xu

AU - Morcos, John

AU - Ilyas, Ihab F.

AU - Ouzzani, Mourad

AU - Papotti, Paolo

AU - Tang, Nan

AU - Ye, Yin

PY - 2015

Y1 - 2015

N2 - Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

AB - Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

UR - http://www.scopus.com/inward/record.url?scp=84953868998&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953868998&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84953868998

VL - 8

SP - 1952

EP - 1955

JO - Scanning Electron Microscopy

JF - Scanning Electron Microscopy

SN - 0586-5581

IS - 12

ER -