TY - JOUR
T1 - A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records
AU - Tahsin, Tasnia
AU - Weissenbacher, Davy
AU - Rivera, Robert
AU - Beard, Rachel
AU - Firago, Mari
AU - Wallstrom, Garrick
AU - Scotch, Matthew
AU - Gonzalez, Graciela
N1 - Funding Information:
Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Number R56AI102559 to G.G. and M.S. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.
AB - Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.
KW - Information extraction
KW - Natural language processing
KW - Phylogeography
UR - http://www.scopus.com/inward/record.url?scp=84995812876&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84995812876&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocv172
DO - 10.1093/jamia/ocv172
M3 - Article
C2 - 26911818
AN - SCOPUS:84995812876
SN - 1067-5027
VL - 23
SP - 934
EP - 941
JO - Journal of the American Medical Informatics Association : JAMIA
JF - Journal of the American Medical Informatics Association : JAMIA
IS - 5
ER -