GeoBoost2: A natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography

Arjun Magge, Davy Weissenbacher, Karen O'Connor, Tasnia Tahsin, Graciela Gonzalez-Hernandez, Matthew Scotch

Research output: Contribution to journalArticlepeer-review

Abstract

We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information's GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning.

Original languageEnglish (US)
Pages (from-to)5120-5121
Number of pages2
JournalBioinformatics
Volume36
Issue number20
DOIs
StatePublished - Oct 15 2020

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint Dive into the research topics of 'GeoBoost2: A natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography'. Together they form a unique fingerprint.

Cite this