Knowledge-driven geospatial location resolution for phylogeographic models of virus migration

Davy Weissenbacher, Tasnia Tahsin, Rachel Beard, Mari Figaro, Robert Rivera, Matthew Scotch, Graciela Gonzalez

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a 'metadata heuristic'). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.

Original languageEnglish (US)
Pages (from-to)i348-i356
JournalBioinformatics
Volume31
Issue number12
DOIs
StatePublished - Jun 15 2015

Fingerprint

Metadata
Viruses
Virus
Migration
Heuristics
Nucleic Acid Databases
Phylogeography
Zoonoses
Model
Sufficient
Public Health
Public health
Error Analysis
Surveillance
Error analysis
Names
Knowledge
Animals
Mutation
Retrieval

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

Weissenbacher, D., Tahsin, T., Beard, R., Figaro, M., Rivera, R., Scotch, M., & Gonzalez, G. (2015). Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. Bioinformatics, 31(12), i348-i356. https://doi.org/10.1093/bioinformatics/btv259

Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. / Weissenbacher, Davy; Tahsin, Tasnia; Beard, Rachel; Figaro, Mari; Rivera, Robert; Scotch, Matthew; Gonzalez, Graciela.

In: Bioinformatics, Vol. 31, No. 12, 15.06.2015, p. i348-i356.

Research output: Contribution to journalArticle

Weissenbacher, D, Tahsin, T, Beard, R, Figaro, M, Rivera, R, Scotch, M & Gonzalez, G 2015, 'Knowledge-driven geospatial location resolution for phylogeographic models of virus migration', Bioinformatics, vol. 31, no. 12, pp. i348-i356. https://doi.org/10.1093/bioinformatics/btv259
Weissenbacher, Davy ; Tahsin, Tasnia ; Beard, Rachel ; Figaro, Mari ; Rivera, Robert ; Scotch, Matthew ; Gonzalez, Graciela. / Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. In: Bioinformatics. 2015 ; Vol. 31, No. 12. pp. i348-i356.
@article{40ba52058c964fc0833105f33a01e362,
title = "Knowledge-driven geospatial location resolution for phylogeographic models of virus migration",
abstract = "Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a 'metadata heuristic'). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.",
author = "Davy Weissenbacher and Tasnia Tahsin and Rachel Beard and Mari Figaro and Robert Rivera and Matthew Scotch and Graciela Gonzalez",
year = "2015",
month = "6",
day = "15",
doi = "10.1093/bioinformatics/btv259",
language = "English (US)",
volume = "31",
pages = "i348--i356",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "12",

}

TY - JOUR

T1 - Knowledge-driven geospatial location resolution for phylogeographic models of virus migration

AU - Weissenbacher, Davy

AU - Tahsin, Tasnia

AU - Beard, Rachel

AU - Figaro, Mari

AU - Rivera, Robert

AU - Scotch, Matthew

AU - Gonzalez, Graciela

PY - 2015/6/15

Y1 - 2015/6/15

N2 - Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a 'metadata heuristic'). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.

AB - Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a 'metadata heuristic'). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.

UR - http://www.scopus.com/inward/record.url?scp=84931067655&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84931067655&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv259

DO - 10.1093/bioinformatics/btv259

M3 - Article

C2 - 26072502

AN - SCOPUS:84931067655

VL - 31

SP - i348-i356

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 12

ER -