No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

Sarah Tahamont; Zubin Jelveh; Melissa McNeill; Shi Yan; Aaron Chalfin; Benjamin Hansen

doi:10.1371/journal.pone.0283811

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

Sarah Tahamont, Zubin Jelveh, Melissa McNeill, Shi Yan, Aaron Chalfin, Benjamin Hansen

Public Service and Community Solutions, Watts College of (WATTS)

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

Original language	English (US)
Article number	e0283811
Journal	PloS one
Volume	18
Issue number	4 April
DOIs	https://doi.org/10.1371/journal.pone.0283811
State	Published - Apr 2023

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0283811

Cite this

@article{a798fe20df874b4791e20abcc893dfad,

title = "No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile",

abstract = "While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.",

author = "Sarah Tahamont and Zubin Jelveh and Melissa McNeill and Shi Yan and Aaron Chalfin and Benjamin Hansen",

note = "Funding Information: The authors would like to acknowledge the entities who supported the development of the open-source linking algorithms we use in our analysis: University of Chicago Crime Lab, New York and Arnold Ventures (Name Match) and dedupe.io (dedupe). Publisher Copyright: {\textcopyright} 2023 Tahamont et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.",

year = "2023",

month = apr,

doi = "10.1371/journal.pone.0283811",

language = "English (US)",

volume = "18",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "4 April",

}

TY - JOUR

T1 - No ground truth? No problem

T2 - Improving administrative data linking using active learning and a little bit of guile

AU - Tahamont, Sarah

AU - Jelveh, Zubin

AU - McNeill, Melissa

AU - Yan, Shi

AU - Chalfin, Aaron

AU - Hansen, Benjamin

N1 - Funding Information: The authors would like to acknowledge the entities who supported the development of the open-source linking algorithms we use in our analysis: University of Chicago Crime Lab, New York and Arnold Ventures (Name Match) and dedupe.io (dedupe). Publisher Copyright: © 2023 Tahamont et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PY - 2023/4

Y1 - 2023/4

N2 - While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

AB - While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

UR - http://www.scopus.com/inward/record.url?scp=85151804677&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85151804677&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0283811

DO - 10.1371/journal.pone.0283811

M3 - Article

C2 - 37014897

AN - SCOPUS:85151804677

SN - 1932-6203

VL - 18

JO - PloS one

JF - PloS one

IS - 4 April

M1 - e0283811

ER -

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this