Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Lawrence A. Adutwum, A. Paulina de la Mata, Heather Bean, Jane E. Hill, James J. Harynuk

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.].

Original languageEnglish (US)
Pages (from-to)6699-6708
Number of pages10
JournalAnalytical and bioanalytical chemistry
Volume409
Issue number28
DOIs
StatePublished - Nov 1 2017

Keywords

  • Chemometrics
  • Classification
  • Cluster resolution
  • Feature selection
  • Fisher ratio
  • Overlapping coefficient

ASJC Scopus subject areas

  • Analytical Chemistry
  • Biochemistry

Fingerprint

Dive into the research topics of 'Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios'. Together they form a unique fingerprint.

Cite this