Inter-species normalization of gene mentions with GNAT

Jörg Hakenberg, Conrad Plake, Robert Leaman, Michael Schroeder, Graciela Gonzalez

    Research output: Contribution to journalArticle

    79 Citations (Scopus)

    Abstract

    Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

    Original languageEnglish (US)
    JournalBioinformatics
    Volume24
    Issue number16
    DOIs
    StatePublished - Aug 2008

    Fingerprint

    Normalization
    Genes
    Gene
    Names
    Publications
    Benchmarking
    Access to Information
    Data Mining
    Information Storage and Retrieval
    Text Mining
    Ambiguous
    Information retrieval
    Phenotype
    Indexing
    Information Retrieval
    Resolve
    Research Personnel
    Databases
    Benchmark
    Term

    ASJC Scopus subject areas

    • Clinical Biochemistry
    • Computer Science Applications
    • Computational Theory and Mathematics

    Cite this

    Hakenberg, J., Plake, C., Leaman, R., Schroeder, M., & Gonzalez, G. (2008). Inter-species normalization of gene mentions with GNAT. Bioinformatics, 24(16). https://doi.org/10.1093/bioinformatics/btn299

    Inter-species normalization of gene mentions with GNAT. / Hakenberg, Jörg; Plake, Conrad; Leaman, Robert; Schroeder, Michael; Gonzalez, Graciela.

    In: Bioinformatics, Vol. 24, No. 16, 08.2008.

    Research output: Contribution to journalArticle

    Hakenberg, J, Plake, C, Leaman, R, Schroeder, M & Gonzalez, G 2008, 'Inter-species normalization of gene mentions with GNAT', Bioinformatics, vol. 24, no. 16. https://doi.org/10.1093/bioinformatics/btn299
    Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics. 2008 Aug;24(16). https://doi.org/10.1093/bioinformatics/btn299
    Hakenberg, Jörg ; Plake, Conrad ; Leaman, Robert ; Schroeder, Michael ; Gonzalez, Graciela. / Inter-species normalization of gene mentions with GNAT. In: Bioinformatics. 2008 ; Vol. 24, No. 16.
    @article{fdece55e29324d6f9f127b4a7f0390af,
    title = "Inter-species normalization of gene mentions with GNAT",
    abstract = "Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4{\%} (90.8{\%} precision at 73.8{\%} recall). For the single-species task, we report an F-measure of 85.4{\%} on human genes.",
    author = "J{\"o}rg Hakenberg and Conrad Plake and Robert Leaman and Michael Schroeder and Graciela Gonzalez",
    year = "2008",
    month = "8",
    doi = "10.1093/bioinformatics/btn299",
    language = "English (US)",
    volume = "24",
    journal = "Bioinformatics",
    issn = "1367-4803",
    publisher = "Oxford University Press",
    number = "16",

    }

    TY - JOUR

    T1 - Inter-species normalization of gene mentions with GNAT

    AU - Hakenberg, Jörg

    AU - Plake, Conrad

    AU - Leaman, Robert

    AU - Schroeder, Michael

    AU - Gonzalez, Graciela

    PY - 2008/8

    Y1 - 2008/8

    N2 - Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

    AB - Motivation: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. Results: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.

    UR - http://www.scopus.com/inward/record.url?scp=49549120418&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=49549120418&partnerID=8YFLogxK

    U2 - 10.1093/bioinformatics/btn299

    DO - 10.1093/bioinformatics/btn299

    M3 - Article

    C2 - 18689813

    AN - SCOPUS:49549120418

    VL - 24

    JO - Bioinformatics

    JF - Bioinformatics

    SN - 1367-4803

    IS - 16

    ER -