TY - GEN
T1 - Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses
AU - Tahsin, Tasnia
AU - Lauder, Rob
AU - Beard, Rachel
AU - Weissenbacher, Davy
AU - Rivera, Robert
AU - Wallstrom, Garrick L
AU - Scotch, Matthew
AU - Gonzalez, Graciela
N1 - Funding Information:
Research reported in this publication was supported by the NIAID of the NIH under Award Number R56AI102559 to MS and GG. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health
Publisher Copyright:
©2014 Association for Computational Linguistics
PY - 2014
Y1 - 2014
N2 - Zoonotic viruses, viruses that are transmittable between animals and humans, represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Phylogeographic techniques may be applied to trace the origins and geographical distribution of these viruses using sequence and location data, which are often obtained from publicly available databases such as GenBank. Despite the abundance of zoonotic viral sequence data in GenBank records, phylogeographic analysis of these viruses is greatly limited by the lack of adequate geographic metadata. Although more detailed information may often be found in the related articles referenced in these records, manual extraction of this information presents a severe bottleneck. In this work, we propose an automated system for extracting this information using Natural Language Processing (NLP) methods. In order to validate the need for such a system, we first determine the percentage of GenBank records with “insufficient” geographic metadata for seven well-studied zoonotic viruses. We then evaluate four different named entity recognition (NER) systems which may help in the automatic extraction of information from related articles that can be used to improve the GenBank geographic metadata. This includes a novel dictionary-based location tagging system that we introduce in this paper.
AB - Zoonotic viruses, viruses that are transmittable between animals and humans, represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Phylogeographic techniques may be applied to trace the origins and geographical distribution of these viruses using sequence and location data, which are often obtained from publicly available databases such as GenBank. Despite the abundance of zoonotic viral sequence data in GenBank records, phylogeographic analysis of these viruses is greatly limited by the lack of adequate geographic metadata. Although more detailed information may often be found in the related articles referenced in these records, manual extraction of this information presents a severe bottleneck. In this work, we propose an automated system for extracting this information using Natural Language Processing (NLP) methods. In order to validate the need for such a system, we first determine the percentage of GenBank records with “insufficient” geographic metadata for seven well-studied zoonotic viruses. We then evaluate four different named entity recognition (NER) systems which may help in the automatic extraction of information from related articles that can be used to improve the GenBank geographic metadata. This includes a novel dictionary-based location tagging system that we introduce in this paper.
UR - http://www.scopus.com/inward/record.url?scp=85122539416&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122539416&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85122539416
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1
EP - 9
BT - ACL 2014 - BioNLP 2014, Workshop on Biomedical Natural Language Processing, Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
T2 - ACL 2014 Workshop on Biomedical Natural Language Processing, BioNLP 2014
Y2 - 27 June 2014 through 28 June 2014
ER -