Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses

Tasnia Tahsin, Rob Lauder, Rachel Beard, Davy Weissenbacher, Robert Rivera, Garrick L Wallstrom, Matthew Scotch, Graciela Gonzalez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Zoonotic viruses, viruses that are transmittable between animals and humans, represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Phylogeographic techniques may be applied to trace the origins and geographical distribution of these viruses using sequence and location data, which are often obtained from publicly available databases such as GenBank. Despite the abundance of zoonotic viral sequence data in GenBank records, phylogeographic analysis of these viruses is greatly limited by the lack of adequate geographic metadata. Although more detailed information may often be found in the related articles referenced in these records, manual extraction of this information presents a severe bottleneck. In this work, we propose an automated system for extracting this information using Natural Language Processing (NLP) methods. In order to validate the need for such a system, we first determine the percentage of GenBank records with “insufficient” geographic metadata for seven well-studied zoonotic viruses. We then evaluate four different named entity recognition (NER) systems which may help in the automatic extraction of information from related articles that can be used to improve the GenBank geographic metadata. This includes a novel dictionary-based location tagging system that we introduce in this paper.

Original languageEnglish (US)
Title of host publicationACL 2014 - BioNLP 2014, Workshop on Biomedical Natural Language Processing, Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages1-9
Number of pages9
ISBN (Electronic)9781941643181
StatePublished - 2014
Externally publishedYes
EventACL 2014 Workshop on Biomedical Natural Language Processing, BioNLP 2014 - Baltimore, United States
Duration: Jun 27 2014Jun 28 2014

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceACL 2014 Workshop on Biomedical Natural Language Processing, BioNLP 2014
Country/TerritoryUnited States
CityBaltimore
Period6/27/146/28/14

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses'. Together they form a unique fingerprint.

Cite this