Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures

Graciela Gonzalez; Juan C. Uribe; Luis Tari; Colleen Brophy; Chitta Baral

Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures

Graciela Gonzalez, Juan C. Uribe, Luis Tari, Colleen Brophy, Chitta Baral

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.

Original language	English (US)
Title of host publication	Pacific Symposium on Biocomputing 2007, PSB 2007
Publisher	World Scientific Publishing Co. Pte Ltd
Pages	28-39
Number of pages	12
ISBN (Print)	9812704175, 9789812704177
State	Published - 2007
Event	Pacific Symposium on Biocomputing, PSB 2007 - Maui, HI, United States Duration: Jan 3 2007 → Jan 7 2007

Publication series

Name	Pacific Symposium on Biocomputing 2007, PSB 2007

Other

Other	Pacific Symposium on Biocomputing, PSB 2007
Country/Territory	United States
City	Maui, HI
Period	1/3/07 → 1/7/07

ASJC Scopus subject areas

Computational Theory and Mathematics
Biomedical Engineering
General Medicine

Cite this

Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures. / Gonzalez, Graciela; Uribe, Juan C.; Tari, Luis et al.
Pacific Symposium on Biocomputing 2007, PSB 2007. World Scientific Publishing Co. Pte Ltd, 2007. p. 28-39 (Pacific Symposium on Biocomputing 2007, PSB 2007).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Gonzalez, G, Uribe, JC, Tari, L, Brophy, C & Baral, C 2007, Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures. in Pacific Symposium on Biocomputing 2007, PSB 2007. Pacific Symposium on Biocomputing 2007, PSB 2007, World Scientific Publishing Co. Pte Ltd, pp. 28-39, Pacific Symposium on Biocomputing, PSB 2007, Maui, HI, United States, 1/3/07.

@inproceedings{1b00d54a82fb42dab295189a732eaca1,

title = "Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures",

abstract = "Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.",

author = "Graciela Gonzalez and Uribe, {Juan C.} and Luis Tari and Colleen Brophy and Chitta Baral",

year = "2007",

language = "English (US)",

isbn = "9812704175",

series = "Pacific Symposium on Biocomputing 2007, PSB 2007",

publisher = "World Scientific Publishing Co. Pte Ltd",

pages = "28--39",

booktitle = "Pacific Symposium on Biocomputing 2007, PSB 2007",

note = "Pacific Symposium on Biocomputing, PSB 2007 ; Conference date: 03-01-2007 Through 07-01-2007",

}

TY - GEN

T1 - Mining gene-disease relationships from biomedical literature

T2 - Pacific Symposium on Biocomputing, PSB 2007

AU - Gonzalez, Graciela

AU - Uribe, Juan C.

AU - Tari, Luis

AU - Brophy, Colleen

AU - Baral, Chitta

PY - 2007

Y1 - 2007

N2 - Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.

AB - Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.

UR - http://www.scopus.com/inward/record.url?scp=38349002957&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38349002957&partnerID=8YFLogxK

M3 - Conference contribution

C2 - 17992743

AN - SCOPUS:38349002957

SN - 9812704175

SN - 9789812704177

T3 - Pacific Symposium on Biocomputing 2007, PSB 2007

SP - 28

EP - 39

BT - Pacific Symposium on Biocomputing 2007, PSB 2007

PB - World Scientific Publishing Co. Pte Ltd

Y2 - 3 January 2007 through 7 January 2007

ER -

Mining gene-disease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this