Techniques to cope with missing data in host-pathogen protein interaction prediction

Meghana Kshirsagar; Jaime Carbonell; Judith Klein-Seetharaman

doi:10.1093/bioinformatics/bts375

Techniques to cope with missing data in host-pathogen protein interaction prediction

Meghana Kshirsagar, Jaime Carbonell, Judith Klein-Seetharaman

Research output: Contribution to journal › Article › peer-review

37 Scopus citations

Abstract

Motivation: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with l₁/l₂ regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach.

Original language	English (US)
Article number	bts375
Pages (from-to)	i466-i472
Journal	Bioinformatics
Volume	28
Issue number	18
DOIs	https://doi.org/10.1093/bioinformatics/bts375
State	Published - Sep 2012
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/bts375

Cite this

@article{de3a473632af43918a4b3180735f26b3,

title = "Techniques to cope with missing data in host-pathogen protein interaction prediction",

abstract = "Motivation: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with l1/l2 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach.",

author = "Meghana Kshirsagar and Jaime Carbonell and Judith Klein-Seetharaman",

note = "Funding Information: Funding: In part, Richard King Mellon Foundation, EraSysBio+ from the European Union and BMBF to SHIPREC, NIH (P50GM082251 and 2RO1LM007994-05), and NSF (CCF-1144281).",

year = "2012",

month = sep,

doi = "10.1093/bioinformatics/bts375",

language = "English (US)",

volume = "28",

pages = "i466--i472",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "18",

}

TY - JOUR

T1 - Techniques to cope with missing data in host-pathogen protein interaction prediction

AU - Kshirsagar, Meghana

AU - Carbonell, Jaime

AU - Klein-Seetharaman, Judith

N1 - Funding Information: Funding: In part, Richard King Mellon Foundation, EraSysBio+ from the European Union and BMBF to SHIPREC, NIH (P50GM082251 and 2RO1LM007994-05), and NSF (CCF-1144281).

PY - 2012/9

Y1 - 2012/9

N2 - Motivation: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with l1/l2 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach.

AB - Motivation: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with l1/l2 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach.

UR - http://www.scopus.com/inward/record.url?scp=84866452395&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866452395&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bts375

DO - 10.1093/bioinformatics/bts375

M3 - Article

C2 - 22962468

AN - SCOPUS:84866452395

SN - 1367-4803

VL - 28

SP - i466-i472

JO - Bioinformatics

JF - Bioinformatics

IS - 18

M1 - bts375

ER -

Techniques to cope with missing data in host-pathogen protein interaction prediction

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this