Inter-species normalization of gene mentions with GNAT

Jörg Hakenberg, Conrad Plake, Robert Leaman, Michael Schroeder, Graciela Gonzalez

Research output: Contribution to journalArticle

79 Citations (Scopus)

Abstract

Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

Original languageEnglish (US)
JournalBioinformatics
Volume24
Issue number16
DOIs
StatePublished - Aug 2008

Fingerprint

Normalization
Genes
Gene
Names
Publications
Benchmarking
Access to Information
Data Mining
Information Storage and Retrieval
Text Mining
Ambiguous
Information retrieval
Phenotype
Indexing
Information Retrieval
Resolve
Research Personnel
Databases
Benchmark
Term

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Hakenberg, J., Plake, C., Leaman, R., Schroeder, M., & Gonzalez, G. (2008). Inter-species normalization of gene mentions with GNAT. Bioinformatics, 24(16). https://doi.org/10.1093/bioinformatics/btn299

Inter-species normalization of gene mentions with GNAT. / Hakenberg, Jörg; Plake, Conrad; Leaman, Robert; Schroeder, Michael; Gonzalez, Graciela.

In: Bioinformatics, Vol. 24, No. 16, 08.2008.

Research output: Contribution to journalArticle

Hakenberg, J, Plake, C, Leaman, R, Schroeder, M & Gonzalez, G 2008, 'Inter-species normalization of gene mentions with GNAT', Bioinformatics, vol. 24, no. 16. https://doi.org/10.1093/bioinformatics/btn299
Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics. 2008 Aug;24(16). https://doi.org/10.1093/bioinformatics/btn299
Hakenberg, Jörg ; Plake, Conrad ; Leaman, Robert ; Schroeder, Michael ; Gonzalez, Graciela. / Inter-species normalization of gene mentions with GNAT. In: Bioinformatics. 2008 ; Vol. 24, No. 16.
@article{fdece55e29324d6f9f127b4a7f0390af,
title = "Inter-species normalization of gene mentions with GNAT",
abstract = "Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4{\%} (90.8{\%} precision at 73.8{\%} recall). For the single-species task, we report an F-measure of 85.4{\%} on human genes.",
author = "J{\"o}rg Hakenberg and Conrad Plake and Robert Leaman and Michael Schroeder and Graciela Gonzalez",
year = "2008",
month = "8",
doi = "10.1093/bioinformatics/btn299",
language = "English (US)",
volume = "24",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "16",

}

TY - JOUR

T1 - Inter-species normalization of gene mentions with GNAT

AU - Hakenberg, Jörg

AU - Plake, Conrad

AU - Leaman, Robert

AU - Schroeder, Michael

AU - Gonzalez, Graciela

PY - 2008/8

Y1 - 2008/8

N2 - Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

AB - Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

UR - http://www.scopus.com/inward/record.url?scp=49549120418&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=49549120418&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btn299

DO - 10.1093/bioinformatics/btn299

M3 - Article

VL - 24

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 16

ER -