Comparative study of classification algorithms for immunosignaturing data

Muskan Kukreja; Stephen Johnston; Phillip Stafford

doi:10.1186/1471-2105-13-139

Comparative study of classification algorithms for immunosignaturing data

Muskan Kukreja, Stephen Johnston, Phillip Stafford

Research output: Contribution to journal › Article › peer-review

34 Scopus citations

Abstract

Background: High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.Results: We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found 'Naïve Bayes' far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.Conclusions: 'Naïve Bayes' algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.

Original language	English (US)
Article number	139
Journal	BMC bioinformatics
Volume	13
Issue number	1
DOIs	https://doi.org/10.1186/1471-2105-13-139
State	Published - Jun 21 2012

Keywords

Classification algorithms
Data mining
Immunosignature
Naïve Bayes
Random peptide microarray

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/1471-2105-13-139

Cite this

@article{f658500a5ef1499b84fd1107682ae73d,

title = "Comparative study of classification algorithms for immunosignaturing data",

abstract = "Background: High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.Results: We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found 'Na{\"i}ve Bayes' far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.Conclusions: 'Na{\"i}ve Bayes' algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.",

keywords = "Classification algorithms, Data mining, Immunosignature, Na{\"i}ve Bayes, Random peptide microarray",

author = "Muskan Kukreja and Stephen Johnston and Phillip Stafford",

note = "Funding Information: We are thankful to Dr. AbnerNotkins, NIH, National Institute of Dental and Craniofacial Research for providing type 1 diabetes sample, DrLucusRestrepo for providing Alzheimer{\textquoteright}s dataset. Dr Bart Legutki and Dr Rebecca Halperin for providing Antibodies dataset, University of Arizona, department of Pharmacy and Pharmacology (Serine Lau, Donata Vercelli, Marilyn Halonen) for providing Asthma samples, Peptide Microarray Core (DrZbigniewCichacz) for providing Asthma datasets, PradeepKanwar for implementing the time function in JAVA and ValentinDinu for invaluable discussion regarding algorithm selection. This work was supported by an Innovator Award from the DoD Breast Cancer Program to SAJ.",

year = "2012",

month = jun,

day = "21",

doi = "10.1186/1471-2105-13-139",

language = "English (US)",

volume = "13",

journal = "BMC bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Comparative study of classification algorithms for immunosignaturing data

AU - Kukreja, Muskan

AU - Johnston, Stephen

AU - Stafford, Phillip

N1 - Funding Information: We are thankful to Dr. AbnerNotkins, NIH, National Institute of Dental and Craniofacial Research for providing type 1 diabetes sample, DrLucusRestrepo for providing Alzheimer’s dataset. Dr Bart Legutki and Dr Rebecca Halperin for providing Antibodies dataset, University of Arizona, department of Pharmacy and Pharmacology (Serine Lau, Donata Vercelli, Marilyn Halonen) for providing Asthma samples, Peptide Microarray Core (DrZbigniewCichacz) for providing Asthma datasets, PradeepKanwar for implementing the time function in JAVA and ValentinDinu for invaluable discussion regarding algorithm selection. This work was supported by an Innovator Award from the DoD Breast Cancer Program to SAJ.

PY - 2012/6/21

Y1 - 2012/6/21

N2 - Background: High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.Results: We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found 'Naïve Bayes' far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.Conclusions: 'Naïve Bayes' algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.

AB - Background: High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.Results: We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found 'Naïve Bayes' far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.Conclusions: 'Naïve Bayes' algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.

KW - Classification algorithms

KW - Data mining

KW - Immunosignature

KW - Naïve Bayes

KW - Random peptide microarray

UR - http://www.scopus.com/inward/record.url?scp=84862491822&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862491822&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-13-139

DO - 10.1186/1471-2105-13-139

M3 - Article

C2 - 22720696

AN - SCOPUS:84862491822

SN - 1471-2105

VL - 13

JO - BMC bioinformatics

JF - BMC bioinformatics

IS - 1

M1 - 139

ER -

Comparative study of classification algorithms for immunosignaturing data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this