Exploiting multilabel information for noise-resilient feature selection

Ling Jian; Jundong Li; Huan Liu

doi:10.1145/3158675

Exploiting multilabel information for noise-resilient feature selection

Ling Jian, Jundong Li, Huan Liu

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

Original language	English (US)
Article number	52
Journal	ACM Transactions on Intelligent Systems and Technology
Volume	9
Issue number	5
DOIs	https://doi.org/10.1145/3158675
State	Published - Apr 2018

Keywords

Feature selection
Label correlations
Multilabel learning
Noise resilient

ASJC Scopus subject areas

Theoretical Computer Science
Artificial Intelligence

Access to Document

10.1145/3158675

Cite this

@article{a79d81ed60784ef48071c91f4b181add,

title = "Exploiting multilabel information for noise-resilient feature selection",

abstract = "In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.",

keywords = "Feature selection, Label correlations, Multilabel learning, Noise resilient",

author = "Ling Jian and Jundong Li and Huan Liu",

note = "Funding Information: L. Jian is supported by the National Natural Science Foundation of China under Grant No. 61403419 and 61503412, and Fundamental Research Funds for the Central Universities under Grant No. 16CX02048A and 17CX05015B. J. Li and H. Liu are supported by the National Science Foundation under Grant No. 1217466 and 1614576. Funding Information: L. Jian is supported by the National Natural Science Foundation of China under Grant No. 61403419 and 61503412, and Fundamental Research Funds for the Central Universities under Grant No. 16CX02048A and 17CX05015B. J. Li and H. Liu are supported by the National Science Foundation under Grant No. 1217466 and 1614576. Authors{\textquoteright} addresses: L. Jian, College of Science, China University of Petroleum, ingdao, 266580, China; email: bebetter@upc. edu.cn; J. Li and H. Liu, Computer Science and Engineering, Arizona State University, Tempe, AZ, USA; emails: {jundongl, huanliu}@asu.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. {\textcopyright} 2018 ACM 2157-6904/2018/06-ART52 $15.00 https://doi.org/10.1145/3158675 Publisher Copyright: {\textcopyright} 2018 ACM.",

year = "2018",

month = apr,

doi = "10.1145/3158675",

language = "English (US)",

volume = "9",

journal = "ACM Transactions on Intelligent Systems and Technology",

issn = "2157-6904",

publisher = "Association for Computing Machinery (ACM)",

number = "5",

}

TY - JOUR

T1 - Exploiting multilabel information for noise-resilient feature selection

AU - Jian, Ling

AU - Li, Jundong

AU - Liu, Huan

N1 - Funding Information: L. Jian is supported by the National Natural Science Foundation of China under Grant No. 61403419 and 61503412, and Fundamental Research Funds for the Central Universities under Grant No. 16CX02048A and 17CX05015B. J. Li and H. Liu are supported by the National Science Foundation under Grant No. 1217466 and 1614576. Funding Information: L. Jian is supported by the National Natural Science Foundation of China under Grant No. 61403419 and 61503412, and Fundamental Research Funds for the Central Universities under Grant No. 16CX02048A and 17CX05015B. J. Li and H. Liu are supported by the National Science Foundation under Grant No. 1217466 and 1614576. Authors’ addresses: L. Jian, College of Science, China University of Petroleum, ingdao, 266580, China; email: bebetter@upc. edu.cn; J. Li and H. Liu, Computer Science and Engineering, Arizona State University, Tempe, AZ, USA; emails: {jundongl, huanliu}@asu.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 2157-6904/2018/06-ART52 $15.00 https://doi.org/10.1145/3158675 Publisher Copyright: © 2018 ACM.

PY - 2018/4

Y1 - 2018/4

N2 - In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

AB - In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

KW - Feature selection

KW - Label correlations

KW - Multilabel learning

KW - Noise resilient

UR - http://www.scopus.com/inward/record.url?scp=85054026398&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054026398&partnerID=8YFLogxK

U2 - 10.1145/3158675

DO - 10.1145/3158675

M3 - Article

AN - SCOPUS:85054026398

SN - 2157-6904

VL - 9

JO - ACM Transactions on Intelligent Systems and Technology

JF - ACM Transactions on Intelligent Systems and Technology

IS - 5

M1 - 52

ER -

Exploiting multilabel information for noise-resilient feature selection

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this