Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Lawrence A. Adutwum; A. Paulina de la Mata; Heather Bean; Jane E. Hill; James J. Harynuk

doi:10.1007/s00216-017-0628-8

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Lawrence A. Adutwum, A. Paulina de la Mata, Heather Bean, Jane E. Hill, James J. Harynuk

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.].

Original language	English (US)
Pages (from-to)	6699-6708
Number of pages	10
Journal	Analytical and bioanalytical chemistry
Volume	409
Issue number	28
DOIs	https://doi.org/10.1007/s00216-017-0628-8
State	Published - Nov 1 2017

Keywords

Chemometrics
Classification
Cluster resolution
Feature selection
Fisher ratio
Overlapping coefficient

ASJC Scopus subject areas

Analytical Chemistry
Biochemistry

Access to Document

10.1007/s00216-017-0628-8

Cite this

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios. / Adutwum, Lawrence A.; de la Mata, A. Paulina; Bean, Heather et al.
In: Analytical and bioanalytical chemistry, Vol. 409, No. 28, 01.11.2017, p. 6699-6708.

Research output: Contribution to journal › Article › peer-review

@article{a438c4ad95fb4009a1e03a00af184d53,

title = "Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios",

abstract = "Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.].",

keywords = "Chemometrics, Classification, Cluster resolution, Feature selection, Fisher ratio, Overlapping coefficient",

author = "Adutwum, {Lawrence A.} and {de la Mata}, {A. Paulina} and Heather Bean and Hill, {Jane E.} and Harynuk, {James J.}",

note = "Funding Information: Acknowledgements The authors wish to acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Genome Canada and Genome Alberta, as well as Cystic Fibrosis Foundation Postdoctoral Fellowship (Bean12F0) and CF Isolate Core at Seattle Children{\textquoteright}s Research Institute (NIH P30 DK089507) for funding this research. They also wish to thank Dr. Aiko Barsch (Bruker Daltonics) for the coffee data used in this study. Publisher Copyright: {\textcopyright} 2017, Springer-Verlag GmbH Germany.",

year = "2017",

month = nov,

day = "1",

doi = "10.1007/s00216-017-0628-8",

language = "English (US)",

volume = "409",

pages = "6699--6708",

journal = "Analytical and bioanalytical chemistry",

issn = "1618-2642",

publisher = "Springer Verlag",

number = "28",

}

TY - JOUR

T1 - Estimation of start and stop numbers for cluster resolution feature selection algorithm

T2 - an empirical approach using null distribution analysis of Fisher ratios

AU - Adutwum, Lawrence A.

AU - de la Mata, A. Paulina

AU - Bean, Heather

AU - Hill, Jane E.

AU - Harynuk, James J.

N1 - Funding Information: Acknowledgements The authors wish to acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Genome Canada and Genome Alberta, as well as Cystic Fibrosis Foundation Postdoctoral Fellowship (Bean12F0) and CF Isolate Core at Seattle Children’s Research Institute (NIH P30 DK089507) for funding this research. They also wish to thank Dr. Aiko Barsch (Bruker Daltonics) for the coffee data used in this study. Publisher Copyright: © 2017, Springer-Verlag GmbH Germany.

PY - 2017/11/1

Y1 - 2017/11/1

N2 - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.].

AB - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.].

KW - Chemometrics

KW - Classification

KW - Cluster resolution

KW - Feature selection

KW - Fisher ratio

KW - Overlapping coefficient

UR - http://www.scopus.com/inward/record.url?scp=85030174327&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030174327&partnerID=8YFLogxK

U2 - 10.1007/s00216-017-0628-8

DO - 10.1007/s00216-017-0628-8

M3 - Article

AN - SCOPUS:85030174327

SN - 1618-2642

VL - 409

SP - 6699

EP - 6708

JO - Analytical and bioanalytical chemistry

JF - Analytical and bioanalytical chemistry

IS - 28

ER -

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this