TY - JOUR
T1 - No ground truth? No problem
T2 - Improving administrative data linking using active learning and a little bit of guile
AU - Tahamont, Sarah
AU - Jelveh, Zubin
AU - McNeill, Melissa
AU - Yan, Shi
AU - Chalfin, Aaron
AU - Hansen, Benjamin
N1 - Funding Information:
The authors would like to acknowledge the entities who supported the development of the open-source linking algorithms we use in our analysis: University of Chicago Crime Lab, New York and Arnold Ventures (Name Match) and dedupe.io (dedupe).
Publisher Copyright:
© 2023 Tahamont et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2023/4
Y1 - 2023/4
N2 - While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.
AB - While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.
UR - http://www.scopus.com/inward/record.url?scp=85151804677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85151804677&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0283811
DO - 10.1371/journal.pone.0283811
M3 - Article
C2 - 37014897
AN - SCOPUS:85151804677
SN - 1932-6203
VL - 18
JO - PLoS One
JF - PLoS One
IS - 4 April
M1 - e0283811
ER -