Deep neural networks and distant supervision for geographic location mention extraction

Arjun Magge; Davy Weissenbacher; Abeed Sarker; Matthew Scotch; Graciela Gonzalez-Hernandez

doi:10.1093/bioinformatics/bty273

Deep neural networks and distant supervision for geographic location mention extraction

Arjun Magge, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez

Research output: Contribution to journal › Article › peer-review

20 Scopus citations

Abstract

Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

Original language	English (US)
Pages (from-to)	i565-i573
Journal	Bioinformatics
Volume	34
Issue number	13
DOIs	https://doi.org/10.1093/bioinformatics/bty273
State	Published - Jul 1 2018

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/bty273

Cite this

@article{36aea5cb3cb048f488eb9a4aeb8ff649,

title = "Deep neural networks and distant supervision for geographic location mention extraction",

abstract = "Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.",

author = "Arjun Magge and Davy Weissenbacher and Abeed Sarker and Matthew Scotch and Graciela Gonzalez-Hernandez",

note = "Funding Information: Thus, current G1-like viruses in southern China might have originally been introduced from Middle Eastern countries, or it is also likely that the virus spread the other way around, similar to the transmission of FIG. This work was supported by a Natural Sciences and Engineering Research Council of Canada discovery grant. Abbreviations: BJ and Bei, Beijing; Ck, chicken; Dk, duck. Virus Group State of isolation Date of isolation A/chicken/Nigeria/1071-1/2007 EMA1/ EMA2-2: 6-R07 Plateau Jan 2 A/chicken/Nigeria/1071-3/2007 EMA2 Sokoto Jan 5. The characterization of the swH3N2 / pH1N1 reassortant viruses from swine in the prov-ince of Quebec indicates that reassortment of gene segments had occurred between the North American swine H3N2. Centers for Disease Control and Prevention, Atlanta, Ga. Funding Information: Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Funding Information: AM designed and trained the neural network, ran the experiments, performed the error analysis and wrote most of the manuscript. DW proposed the idea of using of distant supervision for improving the CRF NER{\textquoteright}s performance in the previous manuscript, created the distant supervision dataset, supervised the experiments and wrote revisions of the manuscript. AS reviewed, restructured and contributed many sections and revisions of the manuscript. MS and GG provided overall guidance on the work and edited the final manuscript. The authors would also like to acknowledge Karen O{\textquoteright}Connor, Megan Rorison and Briana Trevino for their efforts in the annotation processes. The authors are grateful to the anonymous reviewers for their valuable feedback and comments to improve the quality of the paper. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Publisher Copyright: {\textcopyright} The Author(s) 2018. Published by Oxford University Press. All rights reserved.",

year = "2018",

month = jul,

day = "1",

doi = "10.1093/bioinformatics/bty273",

language = "English (US)",

volume = "34",

pages = "i565--i573",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "13",

}

TY - JOUR

T1 - Deep neural networks and distant supervision for geographic location mention extraction

AU - Magge, Arjun

AU - Weissenbacher, Davy

AU - Sarker, Abeed

AU - Scotch, Matthew

AU - Gonzalez-Hernandez, Graciela

N1 - Funding Information: Thus, current G1-like viruses in southern China might have originally been introduced from Middle Eastern countries, or it is also likely that the virus spread the other way around, similar to the transmission of FIG. This work was supported by a Natural Sciences and Engineering Research Council of Canada discovery grant. Abbreviations: BJ and Bei, Beijing; Ck, chicken; Dk, duck. Virus Group State of isolation Date of isolation A/chicken/Nigeria/1071-1/2007 EMA1/ EMA2-2: 6-R07 Plateau Jan 2 A/chicken/Nigeria/1071-3/2007 EMA2 Sokoto Jan 5. The characterization of the swH3N2 / pH1N1 reassortant viruses from swine in the prov-ince of Quebec indicates that reassortment of gene segments had occurred between the North American swine H3N2. Centers for Disease Control and Prevention, Atlanta, Ga. Funding Information: Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Funding Information: AM designed and trained the neural network, ran the experiments, performed the error analysis and wrote most of the manuscript. DW proposed the idea of using of distant supervision for improving the CRF NER’s performance in the previous manuscript, created the distant supervision dataset, supervised the experiments and wrote revisions of the manuscript. AS reviewed, restructured and contributed many sections and revisions of the manuscript. MS and GG provided overall guidance on the work and edited the final manuscript. The authors would also like to acknowledge Karen O’Connor, Megan Rorison and Briana Trevino for their efforts in the annotation processes. The authors are grateful to the anonymous reviewers for their valuable feedback and comments to improve the quality of the paper. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Publisher Copyright: © The Author(s) 2018. Published by Oxford University Press. All rights reserved.

PY - 2018/7/1

Y1 - 2018/7/1

N2 - Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

AB - Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

UR - http://www.scopus.com/inward/record.url?scp=85050803315&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050803315&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bty273

DO - 10.1093/bioinformatics/bty273

M3 - Article

C2 - 29950020

AN - SCOPUS:85050803315

SN - 1367-4803

VL - 34

SP - i565-i573

JO - Bioinformatics

JF - Bioinformatics

IS - 13

ER -

Deep neural networks and distant supervision for geographic location mention extraction

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this