Abstract

In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

Original languageEnglish (US)
Article number52
JournalACM Transactions on Intelligent Systems and Technology
Volume9
Issue number5
DOIs
StatePublished - Apr 1 2018

Fingerprint

Feature Selection
Feature extraction
Labels
Imperfect
Annotation
Target
Curse of Dimensionality
Dimensionality Reduction
Supervised Learning
High-dimensional Data
Information Retrieval
Empirical Study
Dimensionality
Labeling
Multimedia
Learning Algorithm
Bioinformatics
Data Mining
Machine Learning
Regression

Keywords

  • Feature selection
  • Label correlations
  • Multilabel learning
  • Noise resilient

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Artificial Intelligence

Cite this

Exploiting multilabel information for noise-resilient feature selection. / Jian, Ling; Li, Jundong; Liu, Huan.

In: ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 5, 52, 01.04.2018.

Research output: Contribution to journalArticle

@article{a79d81ed60784ef48071c91f4b181add,
title = "Exploiting multilabel information for noise-resilient feature selection",
abstract = "In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.",
keywords = "Feature selection, Label correlations, Multilabel learning, Noise resilient",
author = "Ling Jian and Jundong Li and Huan Liu",
year = "2018",
month = "4",
day = "1",
doi = "10.1145/3158675",
language = "English (US)",
volume = "9",
journal = "ACM Transactions on Intelligent Systems and Technology",
issn = "2157-6904",
publisher = "Association for Computing Machinery (ACM)",
number = "5",

}

TY - JOUR

T1 - Exploiting multilabel information for noise-resilient feature selection

AU - Jian, Ling

AU - Li, Jundong

AU - Liu, Huan

PY - 2018/4/1

Y1 - 2018/4/1

N2 - In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

AB - In a conventional supervised learning paradigm, each data instance is associated with one single class label. Multilabel learning differs in the way that data instances may belong to multiple concepts simultaneously, which naturally appear in a variety of high impact domains, ranging from bioinformatics and information retrieval to multimedia analysis. It targets leveraging the multiple label information of data instances to build a predictive learning model that can classify unlabeled instances into one or multiple predefined target classes. In multilabel learning, even though each instance is associated with a rich set of class labels, the label information could be noisy and incomplete as the labeling process is both time consuming and labor expensive, leading to potential missing annotations or even erroneous annotations. The existence of noisy and missing labels could negatively affect the performance of underlying learning algorithms. More often than not, multilabeled data often has noisy, irrelevant, and redundant features of high dimensionality. The existence of these uninformative features may also deteriorate the predictive power of the learning model due to the curse of dimensionality. Feature selection, as an effective dimensionality reduction technique, has shown to be powerful in preparing high-dimensional data for numerous data mining and machine-learning tasks. However, a vast majority of existing multilabel feature selection algorithms either boil down to solving multiple single-labeled feature selection problems or directly make use of the imperfect labels to guide the selection of representative features. As a result, they may not be able to obtain discriminative features shared across multiple labels. In this article, to bridge the gap between a rich source of multilabel information and its blemish in practical usage, we propose a novel noise-resilient multilabel informed feature selection framework (MIFS) by exploiting the correlations among different labels. In particular, to reduce the negative effects of imperfect label information in obtaining label correlations, we decompose the multilabel information of data instances into a low-dimensional space and then employ the reduced label representation to guide the feature selection phase via a joint sparse regression framework. Empirical studies on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed MIFS framework.

KW - Feature selection

KW - Label correlations

KW - Multilabel learning

KW - Noise resilient

UR - http://www.scopus.com/inward/record.url?scp=85054026398&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054026398&partnerID=8YFLogxK

U2 - 10.1145/3158675

DO - 10.1145/3158675

M3 - Article

AN - SCOPUS:85054026398

VL - 9

JO - ACM Transactions on Intelligent Systems and Technology

JF - ACM Transactions on Intelligent Systems and Technology

SN - 2157-6904

IS - 5

M1 - 52

ER -