TY - JOUR
T1 - Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
AU - Tahsin, Tasnia
AU - Weissenbacher, Davy
AU - Jones-Shargani, Demetrius
AU - Magee, Daniel
AU - Vaiente, Matteo
AU - Gonzalez, Graciela
AU - Scotch, Matthew
N1 - Funding Information:
National Library of Medicine (NLM) of the National Institutes of Health (NIH) (R01LM012080) to M.S.; National Institute of Allergy and Infectious Diseases (NIAID) of the NIH (R01AI117011) to M.S. and G.G. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Publisher Copyright:
© The Author(s) 2017.
PY - 2017
Y1 - 2017
N2 - GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.
AB - GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research.
UR - http://www.scopus.com/inward/record.url?scp=85040983011&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040983011&partnerID=8YFLogxK
U2 - 10.1093/database/bax093
DO - 10.1093/database/bax093
M3 - Article
C2 - 30412219
AN - SCOPUS:85040983011
SN - 1758-0463
VL - 2017
JO - Database
JF - Database
IS - 1
M1 - bax093
ER -