Text Processing and Geospatial Uncertainty for Phylogeography of Zoonotic Viruses

Project: Research project

Project Details

Description

Phylogeography of zoonotic viruses studies the geographical spread and genetic lineages of viruses that are transmittable between animals and humans such as avian influenza, rabies, and West Nile Virus (WNV). This science can help state public health and agriculture agencies identify the animal hosts that most impact virus propagation in a particular geographic region, the migration path of the virus including its origin, and the patterns of infection in various host populations, including humans, over time. The National Center for Biotechnology Information (NCBI), specifically GenBank, provides an abundance of available viral sequence data for phylogeography. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic trees and models for surveillance. However, geospatial metadata such as host location is inconsistently represented and sparse across GenBank entries, with our preliminary studies showing only about 20% of the GenBank records contain specific information such as a county, town, or region within a state. While this detailed geospatial information might be included in the corresponding journal article, it is not available for immediate use in a bioinformatics or GIS application unless it is manually extracted and linked back to the appropriate sequence. Absence of precise sampling locations from easily-computable secondary data sources such as GenBank increases the difficulty of achieving accurate phylogeographic models of virus migration. We propose an infrastructure to improve phylogeographic models of virus migration by linking relevant geospatial data from the literature. This work represents the first effort to use automatically extracted geospatial data present in journal articles corresponding to GenBank records in order to enhance modeling of virus migration. Our research will extend phylogeography and zoonotic surveillance by: creating a Natural Language Processing (NLP) infrastructure that will improve the level of detail of geospatial data for phylogeography of zoonotic viruses (Aim 1), develop phylogeographic models using the data extracted in Aim 1 with adequate biostatistical models (Aim 2), and evaluating the impact of our approach for phylogeography and surveillance of zoonotic viruses (Aim 3). Thus, this work will provide researchers with a framework for population surveillance using an integrated biomedical informatics approach including NLP, biostatistics, bioinformatics, and database design.
StatusFinished
Effective start/end date8/2/137/31/15

Funding

  • HHS: National Institutes of Health (NIH): $451,478.00

Fingerprint Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.