A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records

Tasnia Tahsin, Davy Weissenbacher, Robert Rivera, Rachel Beard, Mari Firago, Garrick Wallstrom, Matthew Scotch, Graciela Gonzalez

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.

Original languageEnglish (US)
Article numberocv172
Pages (from-to)934-941
Number of pages8
JournalJournal of the American Medical Informatics Association
Volume23
Issue number5
DOIs
StatePublished - Sep 1 2016

Fingerprint

Nucleic Acid Databases
Information Storage and Retrieval
Influenza A virus
Databases
Metadata
PubMed
Viruses

Keywords

  • Information extraction
  • Natural language processing
  • Phylogeography

ASJC Scopus subject areas

  • Health Informatics

Cite this

Tahsin, T., Weissenbacher, D., Rivera, R., Beard, R., Firago, M., Wallstrom, G., ... Gonzalez, G. (2016). A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. Journal of the American Medical Informatics Association, 23(5), 934-941. [ocv172]. https://doi.org/10.1093/jamia/ocv172

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. / Tahsin, Tasnia; Weissenbacher, Davy; Rivera, Robert; Beard, Rachel; Firago, Mari; Wallstrom, Garrick; Scotch, Matthew; Gonzalez, Graciela.

In: Journal of the American Medical Informatics Association, Vol. 23, No. 5, ocv172, 01.09.2016, p. 934-941.

Research output: Contribution to journalArticle

Tahsin, T, Weissenbacher, D, Rivera, R, Beard, R, Firago, M, Wallstrom, G, Scotch, M & Gonzalez, G 2016, 'A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records', Journal of the American Medical Informatics Association, vol. 23, no. 5, ocv172, pp. 934-941. https://doi.org/10.1093/jamia/ocv172
Tahsin, Tasnia ; Weissenbacher, Davy ; Rivera, Robert ; Beard, Rachel ; Firago, Mari ; Wallstrom, Garrick ; Scotch, Matthew ; Gonzalez, Graciela. / A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. In: Journal of the American Medical Informatics Association. 2016 ; Vol. 23, No. 5. pp. 934-941.
@article{1cc17faff98d46ada380b4f317fbef38,
title = "A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records",
abstract = "Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.",
keywords = "Information extraction, Natural language processing, Phylogeography",
author = "Tasnia Tahsin and Davy Weissenbacher and Robert Rivera and Rachel Beard and Mari Firago and Garrick Wallstrom and Matthew Scotch and Graciela Gonzalez",
year = "2016",
month = "9",
day = "1",
doi = "10.1093/jamia/ocv172",
language = "English (US)",
volume = "23",
pages = "934--941",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "5",

}

TY - JOUR

T1 - A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records

AU - Tahsin, Tasnia

AU - Weissenbacher, Davy

AU - Rivera, Robert

AU - Beard, Rachel

AU - Firago, Mari

AU - Wallstrom, Garrick

AU - Scotch, Matthew

AU - Gonzalez, Graciela

PY - 2016/9/1

Y1 - 2016/9/1

N2 - Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.

AB - Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively.Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.

KW - Information extraction

KW - Natural language processing

KW - Phylogeography

UR - http://www.scopus.com/inward/record.url?scp=84995812876&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995812876&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocv172

DO - 10.1093/jamia/ocv172

M3 - Article

C2 - 26911818

AN - SCOPUS:84995812876

VL - 23

SP - 934

EP - 941

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - 5

M1 - ocv172

ER -