Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Lawrence A. Adutwum, A. Paulina de la Mata, Heather Bean, Jane E. Hill, James J. Harynuk

    Research output: Contribution to journalArticle

    Abstract

    Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

    Original languageEnglish (US)
    Pages (from-to)1-10
    Number of pages10
    JournalAnalytical and Bioanalytical Chemistry
    DOIs
    StateAccepted/In press - Sep 29 2017

    Fingerprint

    Feature extraction
    Probability density function
    Discriminant analysis
    Discriminant Analysis
    Least-Squares Analysis
    Datasets

    Keywords

    • Chemometrics
    • Classification
    • Cluster resolution
    • Feature selection
    • Fisher ratio
    • Overlapping coefficient

    ASJC Scopus subject areas

    • Analytical Chemistry
    • Biochemistry

    Cite this

    Estimation of start and stop numbers for cluster resolution feature selection algorithm : an empirical approach using null distribution analysis of Fisher ratios. / Adutwum, Lawrence A.; de la Mata, A. Paulina; Bean, Heather; Hill, Jane E.; Harynuk, James J.

    In: Analytical and Bioanalytical Chemistry, 29.09.2017, p. 1-10.

    Research output: Contribution to journalArticle

    @article{a438c4ad95fb4009a1e03a00af184d53,
    title = "Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios",
    abstract = "Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100{\%} for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]",
    keywords = "Chemometrics, Classification, Cluster resolution, Feature selection, Fisher ratio, Overlapping coefficient",
    author = "Adutwum, {Lawrence A.} and {de la Mata}, {A. Paulina} and Heather Bean and Hill, {Jane E.} and Harynuk, {James J.}",
    year = "2017",
    month = "9",
    day = "29",
    doi = "10.1007/s00216-017-0628-8",
    language = "English (US)",
    pages = "1--10",
    journal = "Fresenius Zeitschrift fur Analytische Chemie",
    issn = "0016-1152",
    publisher = "Springer Verlag",

    }

    TY - JOUR

    T1 - Estimation of start and stop numbers for cluster resolution feature selection algorithm

    T2 - an empirical approach using null distribution analysis of Fisher ratios

    AU - Adutwum, Lawrence A.

    AU - de la Mata, A. Paulina

    AU - Bean, Heather

    AU - Hill, Jane E.

    AU - Harynuk, James J.

    PY - 2017/9/29

    Y1 - 2017/9/29

    N2 - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

    AB - Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. [Figure not available: see fulltext.]

    KW - Chemometrics

    KW - Classification

    KW - Cluster resolution

    KW - Feature selection

    KW - Fisher ratio

    KW - Overlapping coefficient

    UR - http://www.scopus.com/inward/record.url?scp=85030174327&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85030174327&partnerID=8YFLogxK

    U2 - 10.1007/s00216-017-0628-8

    DO - 10.1007/s00216-017-0628-8

    M3 - Article

    SP - 1

    EP - 10

    JO - Fresenius Zeitschrift fur Analytische Chemie

    JF - Fresenius Zeitschrift fur Analytische Chemie

    SN - 0016-1152

    ER -