Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature

Arjun Magge, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez

Research output: Contribution to journalArticle

Abstract

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

Original languageEnglish (US)
Pages (from-to)100-111
Number of pages12
JournalPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Volume24
StatePublished - Jan 1 2019

Fingerprint

Geographic Locations
Neural Networks (Computer)
Natural Language Processing
Phylogeography
Nucleic Acid Databases
Names
Viruses
Research
Population

ASJC Scopus subject areas

  • Medicine(all)

Cite this

Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. / Magge, Arjun; Weissenbacher, Davy; Sarker, Abeed; Scotch, Matthew; Gonzalez-Hernandez, Graciela.

In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, Vol. 24, 01.01.2019, p. 100-111.

Research output: Contribution to journalArticle

@article{54c7e4fc8e09412a853c1b2a82f7d307,
title = "Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature",
abstract = "Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91{\%} and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.",
author = "Arjun Magge and Davy Weissenbacher and Abeed Sarker and Matthew Scotch and Graciela Gonzalez-Hernandez",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
volume = "24",
pages = "100--111",
journal = "Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing",
issn = "2335-6936",

}

TY - JOUR

T1 - Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature

AU - Magge, Arjun

AU - Weissenbacher, Davy

AU - Sarker, Abeed

AU - Scotch, Matthew

AU - Gonzalez-Hernandez, Graciela

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

AB - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

UR - http://www.scopus.com/inward/record.url?scp=85062760249&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062760249&partnerID=8YFLogxK

M3 - Article

C2 - 30864314

AN - SCOPUS:85062760249

VL - 24

SP - 100

EP - 111

JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

SN - 2335-6936

ER -