Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Lawrence A. Adutwum, A. Paulina de la Mata, Heather Bean, Jane E. Hill, James J. Harynuk

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

Original languageEnglish (US)
Pages (from-to)1-10
Number of pages10
JournalAnalytical and Bioanalytical Chemistry
DOIs
StateAccepted/In press - Sep 29 2017

Fingerprint

Feature extraction
Probability density function
Discriminant analysis
Discriminant Analysis
Least-Squares Analysis
Datasets

Keywords

  • Chemometrics
  • Classification
  • Cluster resolution
  • Feature selection
  • Fisher ratio
  • Overlapping coefficient

ASJC Scopus subject areas

  • Analytical Chemistry
  • Biochemistry

Cite this

Estimation of start and stop numbers for cluster resolution feature selection algorithm : an empirical approach using null distribution analysis of Fisher ratios. / Adutwum, Lawrence A.; de la Mata, A. Paulina; Bean, Heather; Hill, Jane E.; Harynuk, James J.

In: Analytical and Bioanalytical Chemistry, 29.09.2017, p. 1-10.

Research output: Contribution to journalArticle

@article{a438c4ad95fb4009a1e03a00af184d53,
title = "Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios",
abstract = "Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100{\%} for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]",
keywords = "Chemometrics, Classification, Cluster resolution, Feature selection, Fisher ratio, Overlapping coefficient",
author = "Adutwum, {Lawrence A.} and {de la Mata}, {A. Paulina} and Heather Bean and Hill, {Jane E.} and Harynuk, {James J.}",
year = "2017",
month = "9",
day = "29",
doi = "10.1007/s00216-017-0628-8",
language = "English (US)",
pages = "1--10",
journal = "Fresenius Zeitschrift fur Analytische Chemie",
issn = "0016-1152",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Estimation of start and stop numbers for cluster resolution feature selection algorithm

T2 - an empirical approach using null distribution analysis of Fisher ratios

AU - Adutwum, Lawrence A.

AU - de la Mata, A. Paulina

AU - Bean, Heather

AU - Hill, Jane E.

AU - Harynuk, James J.

PY - 2017/9/29

Y1 - 2017/9/29

N2 - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

AB - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

KW - Chemometrics

KW - Classification

KW - Cluster resolution

KW - Feature selection

KW - Fisher ratio

KW - Overlapping coefficient

UR - http://www.scopus.com/inward/record.url?scp=85030174327&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030174327&partnerID=8YFLogxK

U2 - 10.1007/s00216-017-0628-8

DO - 10.1007/s00216-017-0628-8

M3 - Article

AN - SCOPUS:85030174327

SP - 1

EP - 10

JO - Fresenius Zeitschrift fur Analytische Chemie

JF - Fresenius Zeitschrift fur Analytische Chemie

SN - 0016-1152

ER -