TY - GEN
T1 - Random forest similarity for protein-protein interaction prediction from multiple sources
AU - Qi, Yanjun
AU - Klein-Sbetharaman, Judith
AU - Bar-Joseph, Ziv
PY - 2005
Y1 - 2005
N2 - One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.
AB - One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.
UR - http://www.scopus.com/inward/record.url?scp=15944418607&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=15944418607&partnerID=8YFLogxK
M3 - Conference contribution
C2 - 15759657
AN - SCOPUS:15944418607
SN - 9812560467
SN - 9789812560469
T3 - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005
SP - 531
EP - 542
BT - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005
T2 - 10th Pacific Symposium on Biocomputing, PSB 2005
Y2 - 4 January 2005 through 8 January 2005
ER -