A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Venkata Vamsikrishna Meduri; Lucian Popa; Prithviraj Sen; Mohamed Sarwat

doi:10.1145/3318464.3380597

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, Mohamed Sarwat

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

39 Scopus citations

Abstract

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

Original language	English (US)
Title of host publication	SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	1133-1147
Number of pages	15
ISBN (Electronic)	9781450367356
DOIs	https://doi.org/10.1145/3318464.3380597
State	Published - Jun 14 2020
Event	2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 - Portland, United States Duration: Jun 14 2020 → Jun 19 2020

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
Country/Territory	United States
City	Portland
Period	6/14/20 → 6/19/20

Keywords

SVM
blocking dimensions
ensembles
entity matching
example selectors
learner-agnostic selectors
learner-aware selectors
margin
neural networks
perfect and noisy oracles
query by committee
random forests
rule-based models
unified active learning

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/3318464.3380597

Cite this

Meduri, V. V., Popa, L., Sen, P., & Sarwat, M. (2020). A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 1133-1147). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3318464.3380597

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. / Meduri, Venkata Vamsikrishna; Popa, Lucian; Sen, Prithviraj et al.
SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2020. p. 1133-1147 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Meduri, VV, Popa, L, Sen, P & Sarwat, M 2020, A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. in SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, pp. 1133-1147, 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020, Portland, United States, 6/14/20. https://doi.org/10.1145/3318464.3380597

Meduri VV, Popa L, Sen P, Sarwat M. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery. 2020. p. 1133-1147. (Proceedings of the ACM SIGMOD International Conference on Management of Data). doi: 10.1145/3318464.3380597

Meduri, Venkata Vamsikrishna ; Popa, Lucian ; Sen, Prithviraj et al. / A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, 2020. pp. 1133-1147 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

@inproceedings{d880a1cd81f84b4eacfae154d7a6823c,

title = "A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching",

abstract = "Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.",

keywords = "SVM, blocking dimensions, ensembles, entity matching, example selectors, learner-agnostic selectors, learner-aware selectors, margin, neural networks, perfect and noisy oracles, query by committee, random forests, rule-based models, unified active learning",

author = "Meduri, {Venkata Vamsikrishna} and Lucian Popa and Prithviraj Sen and Mohamed Sarwat",

note = "Publisher Copyright: {\textcopyright} 2020 Association for Computing Machinery.; 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 ; Conference date: 14-06-2020 Through 19-06-2020",

year = "2020",

month = jun,

day = "14",

doi = "10.1145/3318464.3380597",

language = "English (US)",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "1133--1147",

booktitle = "SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

AU - Meduri, Venkata Vamsikrishna

AU - Popa, Lucian

AU - Sen, Prithviraj

AU - Sarwat, Mohamed

PY - 2020/6/14

Y1 - 2020/6/14

N2 - Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

AB - Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

KW - SVM

KW - blocking dimensions

KW - ensembles

KW - entity matching

KW - example selectors

KW - learner-agnostic selectors

KW - learner-aware selectors

KW - margin

KW - neural networks

KW - perfect and noisy oracles

KW - query by committee

KW - random forests

KW - rule-based models

KW - unified active learning

UR - http://www.scopus.com/inward/record.url?scp=85086235419&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85086235419&partnerID=8YFLogxK

U2 - 10.1145/3318464.3380597

DO - 10.1145/3318464.3380597

M3 - Conference contribution

AN - SCOPUS:85086235419

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 1133

EP - 1147

BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020

Y2 - 14 June 2020 through 19 June 2020

ER -

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this