Random forest similarity for protein-protein interaction prediction from multiple sources

Yanjun Qi; Judith Klein-Sbetharaman; Ziv Bar-Joseph

Random forest similarity for protein-protein interaction prediction from multiple sources

Yanjun Qi, Judith Klein-Sbetharaman, Ziv Bar-Joseph

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

Original language	English (US)
Title of host publication	Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005
Pages	531-542
Number of pages	12
State	Published - 2005
Externally published	Yes
Event	10th Pacific Symposium on Biocomputing, PSB 2005 - Big Island of Hawaii, United States Duration: Jan 4 2005 → Jan 8 2005

Publication series

Name	Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005

Other

Other	10th Pacific Symposium on Biocomputing, PSB 2005
Country/Territory	United States
City	Big Island of Hawaii
Period	1/4/05 → 1/8/05

ASJC Scopus subject areas

Computational Theory and Mathematics
Biomedical Engineering

Cite this

Random forest similarity for protein-protein interaction prediction from multiple sources. / Qi, Yanjun; Klein-Sbetharaman, Judith; Bar-Joseph, Ziv.
Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005. 2005. p. 531-542 (Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Qi, Y, Klein-Sbetharaman, J & Bar-Joseph, Z 2005, Random forest similarity for protein-protein interaction prediction from multiple sources. in Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005. Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005, pp. 531-542, 10th Pacific Symposium on Biocomputing, PSB 2005, Big Island of Hawaii, United States, 1/4/05.

@inproceedings{5eb8660455564ecda3b8e9ede08498de,

title = "Random forest similarity for protein-protein interaction prediction from multiple sources",

abstract = "One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.",

author = "Yanjun Qi and Judith Klein-Sbetharaman and Ziv Bar-Joseph",

year = "2005",

language = "English (US)",

isbn = "9812560467",

series = "Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005",

pages = "531--542",

booktitle = "Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005",

note = "10th Pacific Symposium on Biocomputing, PSB 2005 ; Conference date: 04-01-2005 Through 08-01-2005",

}

TY - GEN

T1 - Random forest similarity for protein-protein interaction prediction from multiple sources

AU - Qi, Yanjun

AU - Klein-Sbetharaman, Judith

AU - Bar-Joseph, Ziv

PY - 2005

Y1 - 2005

N2 - One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

AB - One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

UR - http://www.scopus.com/inward/record.url?scp=15944418607&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=15944418607&partnerID=8YFLogxK

M3 - Conference contribution

C2 - 15759657

AN - SCOPUS:15944418607

SN - 9812560467

SN - 9789812560469

T3 - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005

SP - 531

EP - 542

BT - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005

T2 - 10th Pacific Symposium on Biocomputing, PSB 2005

Y2 - 4 January 2005 through 8 January 2005

ER -

Random forest similarity for protein-protein interaction prediction from multiple sources

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this