Deep neural networks and distant supervision for geographic location mention extraction

Arjun Magge, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

Original languageEnglish (US)
Pages (from-to)i565-i573
JournalBioinformatics
Volume34
Issue number13
DOIs
StatePublished - Jul 1 2018

Fingerprint

Geographic Locations
Nucleic Acid Databases
Viruses
Virus
Neural Networks
Literature
DNA Viruses
Art
DNA sequences
Feedforward neural networks
Feedforward Neural Networks
Databases
DNA Sequence
Research
Methodology
Modeling
Demonstrate
Experiment
Deep neural networks
Experiments

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Deep neural networks and distant supervision for geographic location mention extraction. / Magge, Arjun; Weissenbacher, Davy; Sarker, Abeed; Scotch, Matthew; Gonzalez-Hernandez, Graciela.

In: Bioinformatics, Vol. 34, No. 13, 01.07.2018, p. i565-i573.

Research output: Contribution to journalArticle

Magge, A, Weissenbacher, D, Sarker, A, Scotch, M & Gonzalez-Hernandez, G 2018, 'Deep neural networks and distant supervision for geographic location mention extraction', Bioinformatics, vol. 34, no. 13, pp. i565-i573. https://doi.org/10.1093/bioinformatics/bty273
Magge, Arjun ; Weissenbacher, Davy ; Sarker, Abeed ; Scotch, Matthew ; Gonzalez-Hernandez, Graciela. / Deep neural networks and distant supervision for geographic location mention extraction. In: Bioinformatics. 2018 ; Vol. 34, No. 13. pp. i565-i573.
@article{36aea5cb3cb048f488eb9a4aeb8ff649,
title = "Deep neural networks and distant supervision for geographic location mention extraction",
abstract = "Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.",
author = "Arjun Magge and Davy Weissenbacher and Abeed Sarker and Matthew Scotch and Graciela Gonzalez-Hernandez",
year = "2018",
month = "7",
day = "1",
doi = "10.1093/bioinformatics/bty273",
language = "English (US)",
volume = "34",
pages = "i565--i573",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "13",

}

TY - JOUR

T1 - Deep neural networks and distant supervision for geographic location mention extraction

AU - Magge, Arjun

AU - Weissenbacher, Davy

AU - Sarker, Abeed

AU - Scotch, Matthew

AU - Gonzalez-Hernandez, Graciela

PY - 2018/7/1

Y1 - 2018/7/1

N2 - Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

AB - Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

UR - http://www.scopus.com/inward/record.url?scp=85050803315&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050803315&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bty273

DO - 10.1093/bioinformatics/bty273

M3 - Article

C2 - 29950020

AN - SCOPUS:85050803315

VL - 34

SP - i565-i573

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 13

ER -