Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research

Tasnia Tahsin, Davy Weissenbacher, Demetrius Jones-Shargani, Daniel Magee, Matteo Vaiente, Graciela Gonzalez, Matthew Scotch

Research output: Contribution to journalArticle

Abstract

GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.

Original languageEnglish (US)
Article numberbax093
JournalDatabase
Volume2017
Issue number1
DOIs
StatePublished - Jan 1 2017

Fingerprint

biomedical research
Nucleic Acid Databases
Metadata
Biomedical Research
DNA sequences
Databases
National Center for Biotechnology Information
Information Centers
Taxonomies
Biotechnology
Viruses
nucleotide sequences
Hemagglutinins
hemagglutinins
taxonomy
Proteins
PubMed
viruses
Public Health Surveillance
Phylogeography

ASJC Scopus subject areas

  • Information Systems
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Tahsin, T., Weissenbacher, D., Jones-Shargani, D., Magee, D., Vaiente, M., Gonzalez, G., & Scotch, M. (2017). Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. Database, 2017(1), [bax093]. https://doi.org/10.1093/database/bax093

Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. / Tahsin, Tasnia; Weissenbacher, Davy; Jones-Shargani, Demetrius; Magee, Daniel; Vaiente, Matteo; Gonzalez, Graciela; Scotch, Matthew.

In: Database, Vol. 2017, No. 1, bax093, 01.01.2017.

Research output: Contribution to journalArticle

Tahsin, T, Weissenbacher, D, Jones-Shargani, D, Magee, D, Vaiente, M, Gonzalez, G & Scotch, M 2017, 'Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research', Database, vol. 2017, no. 1, bax093. https://doi.org/10.1093/database/bax093
Tahsin T, Weissenbacher D, Jones-Shargani D, Magee D, Vaiente M, Gonzalez G et al. Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. Database. 2017 Jan 1;2017(1). bax093. https://doi.org/10.1093/database/bax093
Tahsin, Tasnia ; Weissenbacher, Davy ; Jones-Shargani, Demetrius ; Magee, Daniel ; Vaiente, Matteo ; Gonzalez, Graciela ; Scotch, Matthew. / Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. In: Database. 2017 ; Vol. 2017, No. 1.
@article{e8075e71027b45e1bde530a8e170bef8,
title = "Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research",
abstract = "GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.",
author = "Tasnia Tahsin and Davy Weissenbacher and Demetrius Jones-Shargani and Daniel Magee and Matteo Vaiente and Graciela Gonzalez and Matthew Scotch",
year = "2017",
month = "1",
day = "1",
doi = "10.1093/database/bax093",
language = "English (US)",
volume = "2017",
journal = "Database : the journal of biological databases and curation",
issn = "1758-0463",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research

AU - Tahsin, Tasnia

AU - Weissenbacher, Davy

AU - Jones-Shargani, Demetrius

AU - Magee, Daniel

AU - Vaiente, Matteo

AU - Gonzalez, Graciela

AU - Scotch, Matthew

PY - 2017/1/1

Y1 - 2017/1/1

N2 - GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.

AB - GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.

UR - http://www.scopus.com/inward/record.url?scp=85040983011&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040983011&partnerID=8YFLogxK

U2 - 10.1093/database/bax093

DO - 10.1093/database/bax093

M3 - Article

VL - 2017

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

SN - 1758-0463

IS - 1

M1 - bax093

ER -