Gaussian mixture model with feature selection: An embedded approach

Yinlin Fu; Xiaonan Liu; Suryadipto Sarkar; Teresa Wu

doi:10.1016/j.cie.2020.107000

Gaussian mixture model with feature selection: An embedded approach

Yinlin Fu, Xiaonan Liu, Suryadipto Sarkar, Teresa Wu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to journal › Article › peer-review

31 Scopus citations

Abstract

Gaussian Mixture Model (GMM) is a popular clustering algorithm due to its neat statistical properties, which enable the “soft” clustering and the determination of the number of clusters. Expectation-Maximization (EM) is usually applied to estimate the GMM parameters. While promising, the inclusion of features that are not contributing to clustering may confuse the model and increase computational cost. Recognizing the issue, in this paper, we propose a new algorithm, termed Expectation Selection Maximization (ESM), by adding a feature selection step (S). Specifically, we introduce a relevancy index (RI), a metric indicating the probability of assigning a data point to a specific clustering group. The RI index reveals the contribution of the feature to the clustering process thus can assist the feature selection. We conduct theoretical analysis to justify the use of RI for feature selection. Also, to demonstrate the efficacy of the proposed ESM, two synthetic datasets, four benchmark datasets, and an Alzheimer's Disease dataset are studied.

Original language	English (US)
Article number	107000
Journal	Computers and Industrial Engineering
Volume	152
DOIs	https://doi.org/10.1016/j.cie.2020.107000
State	Published - Feb 2021

Keywords

Expectation Maximization (EM)
Feature selection
Gaussian Mixture Model (GMM)

ASJC Scopus subject areas

General Computer Science
General Engineering

Access to Document

10.1016/j.cie.2020.107000

Cite this

@article{a458daf6f13e47eab9d7fd3c9064ac0b,

title = "Gaussian mixture model with feature selection: An embedded approach",

abstract = "Gaussian Mixture Model (GMM) is a popular clustering algorithm due to its neat statistical properties, which enable the “soft” clustering and the determination of the number of clusters. Expectation-Maximization (EM) is usually applied to estimate the GMM parameters. While promising, the inclusion of features that are not contributing to clustering may confuse the model and increase computational cost. Recognizing the issue, in this paper, we propose a new algorithm, termed Expectation Selection Maximization (ESM), by adding a feature selection step (S). Specifically, we introduce a relevancy index (RI), a metric indicating the probability of assigning a data point to a specific clustering group. The RI index reveals the contribution of the feature to the clustering process thus can assist the feature selection. We conduct theoretical analysis to justify the use of RI for feature selection. Also, to demonstrate the efficacy of the proposed ESM, two synthetic datasets, four benchmark datasets, and an Alzheimer's Disease dataset are studied.",

keywords = "Expectation Maximization (EM), Feature selection, Gaussian Mixture Model (GMM)",

author = "Yinlin Fu and Xiaonan Liu and Suryadipto Sarkar and Teresa Wu",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier Ltd",

year = "2021",

month = feb,

doi = "10.1016/j.cie.2020.107000",

language = "English (US)",

volume = "152",

journal = "Computers and Industrial Engineering",

issn = "0360-8352",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Gaussian mixture model with feature selection

T2 - An embedded approach

AU - Fu, Yinlin

AU - Liu, Xiaonan

AU - Sarkar, Suryadipto

AU - Wu, Teresa

PY - 2021/2

Y1 - 2021/2

N2 - Gaussian Mixture Model (GMM) is a popular clustering algorithm due to its neat statistical properties, which enable the “soft” clustering and the determination of the number of clusters. Expectation-Maximization (EM) is usually applied to estimate the GMM parameters. While promising, the inclusion of features that are not contributing to clustering may confuse the model and increase computational cost. Recognizing the issue, in this paper, we propose a new algorithm, termed Expectation Selection Maximization (ESM), by adding a feature selection step (S). Specifically, we introduce a relevancy index (RI), a metric indicating the probability of assigning a data point to a specific clustering group. The RI index reveals the contribution of the feature to the clustering process thus can assist the feature selection. We conduct theoretical analysis to justify the use of RI for feature selection. Also, to demonstrate the efficacy of the proposed ESM, two synthetic datasets, four benchmark datasets, and an Alzheimer's Disease dataset are studied.

AB - Gaussian Mixture Model (GMM) is a popular clustering algorithm due to its neat statistical properties, which enable the “soft” clustering and the determination of the number of clusters. Expectation-Maximization (EM) is usually applied to estimate the GMM parameters. While promising, the inclusion of features that are not contributing to clustering may confuse the model and increase computational cost. Recognizing the issue, in this paper, we propose a new algorithm, termed Expectation Selection Maximization (ESM), by adding a feature selection step (S). Specifically, we introduce a relevancy index (RI), a metric indicating the probability of assigning a data point to a specific clustering group. The RI index reveals the contribution of the feature to the clustering process thus can assist the feature selection. We conduct theoretical analysis to justify the use of RI for feature selection. Also, to demonstrate the efficacy of the proposed ESM, two synthetic datasets, four benchmark datasets, and an Alzheimer's Disease dataset are studied.

KW - Expectation Maximization (EM)

KW - Feature selection

KW - Gaussian Mixture Model (GMM)

UR - http://www.scopus.com/inward/record.url?scp=85099266810&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85099266810&partnerID=8YFLogxK

U2 - 10.1016/j.cie.2020.107000

DO - 10.1016/j.cie.2020.107000

M3 - Article

AN - SCOPUS:85099266810

SN - 0360-8352

VL - 152

JO - Computers and Industrial Engineering

JF - Computers and Industrial Engineering

M1 - 107000

ER -

Gaussian mixture model with feature selection: An embedded approach

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this