Enhancing phylogeography by improving geographical information from GenBank

Matthew Scotch, Indra Neil Sarkar, Changjiang Mei, Robert Leaman, Kei Hoi Cheung, Pierina Ortiz, Ashutosh Singraur, Graciela Gonzalez

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.

Original languageEnglish (US)
JournalJournal of Biomedical Informatics
Volume44
Issue numberSUPPL. 1
DOIs
StatePublished - Dec 2011

Fingerprint

Phylogeography
Nucleic Acid Databases
Viruses
Molecular Sequence Data
Processing
Geography
Vertebrates
History
Population

Keywords

  • Bioinformatics
  • Databases
  • Geographic locations
  • Nucleic acid
  • Phylogeography

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

Scotch, M., Sarkar, I. N., Mei, C., Leaman, R., Cheung, K. H., Ortiz, P., ... Gonzalez, G. (2011). Enhancing phylogeography by improving geographical information from GenBank. Journal of Biomedical Informatics, 44(SUPPL. 1). https://doi.org/10.1016/j.jbi.2011.06.005

Enhancing phylogeography by improving geographical information from GenBank. / Scotch, Matthew; Sarkar, Indra Neil; Mei, Changjiang; Leaman, Robert; Cheung, Kei Hoi; Ortiz, Pierina; Singraur, Ashutosh; Gonzalez, Graciela.

In: Journal of Biomedical Informatics, Vol. 44, No. SUPPL. 1, 12.2011.

Research output: Contribution to journalArticle

Scotch, M, Sarkar, IN, Mei, C, Leaman, R, Cheung, KH, Ortiz, P, Singraur, A & Gonzalez, G 2011, 'Enhancing phylogeography by improving geographical information from GenBank', Journal of Biomedical Informatics, vol. 44, no. SUPPL. 1. https://doi.org/10.1016/j.jbi.2011.06.005
Scotch, Matthew ; Sarkar, Indra Neil ; Mei, Changjiang ; Leaman, Robert ; Cheung, Kei Hoi ; Ortiz, Pierina ; Singraur, Ashutosh ; Gonzalez, Graciela. / Enhancing phylogeography by improving geographical information from GenBank. In: Journal of Biomedical Informatics. 2011 ; Vol. 44, No. SUPPL. 1.
@article{d86e28a2a68a4f28bb4de5cba11e30d1,
title = "Enhancing phylogeography by improving geographical information from GenBank",
abstract = "Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.",
keywords = "Bioinformatics, Databases, Geographic locations, Nucleic acid, Phylogeography",
author = "Matthew Scotch and Sarkar, {Indra Neil} and Changjiang Mei and Robert Leaman and Cheung, {Kei Hoi} and Pierina Ortiz and Ashutosh Singraur and Graciela Gonzalez",
year = "2011",
month = "12",
doi = "10.1016/j.jbi.2011.06.005",
language = "English (US)",
volume = "44",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "SUPPL. 1",

}

TY - JOUR

T1 - Enhancing phylogeography by improving geographical information from GenBank

AU - Scotch, Matthew

AU - Sarkar, Indra Neil

AU - Mei, Changjiang

AU - Leaman, Robert

AU - Cheung, Kei Hoi

AU - Ortiz, Pierina

AU - Singraur, Ashutosh

AU - Gonzalez, Graciela

PY - 2011/12

Y1 - 2011/12

N2 - Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.

AB - Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.

KW - Bioinformatics

KW - Databases

KW - Geographic locations

KW - Nucleic acid

KW - Phylogeography

UR - http://www.scopus.com/inward/record.url?scp=83755218843&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83755218843&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2011.06.005

DO - 10.1016/j.jbi.2011.06.005

M3 - Article

C2 - 21723960

AN - SCOPUS:83755218843

VL - 44

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - SUPPL. 1

ER -