Querying discriminative and representative samples for batch mode active learning

Zheng Wang, Jieping Ye

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

Original languageEnglish (US)
Pages (from-to)17
Number of pages1
JournalACM Transactions on Knowledge Discovery from Data
Volume9
Issue number3
DOIs
StatePublished - Feb 1 2015

Fingerprint

Integer programming
Problem-Based Learning
Data mining
Learning systems
Labels
Classifiers
Experiments

Keywords

  • Active learning
  • Empirical risk minimization
  • Maximum mean discrepancy
  • Representative and discriminative

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Querying discriminative and representative samples for batch mode active learning. / Wang, Zheng; Ye, Jieping.

In: ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, 01.02.2015, p. 17.

Research output: Contribution to journalArticle

@article{0b7f5e14bb4b44038baf76642e866ac5,
title = "Querying discriminative and representative samples for batch mode active learning",
abstract = "Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.",
keywords = "Active learning, Empirical risk minimization, Maximum mean discrepancy, Representative and discriminative",
author = "Zheng Wang and Jieping Ye",
year = "2015",
month = "2",
day = "1",
doi = "10.1145/2700408",
language = "English (US)",
volume = "9",
pages = "17",
journal = "ACM Transactions on Knowledge Discovery from Data",
issn = "1556-4681",
publisher = "Association for Computing Machinery (ACM)",
number = "3",

}

TY - JOUR

T1 - Querying discriminative and representative samples for batch mode active learning

AU - Wang, Zheng

AU - Ye, Jieping

PY - 2015/2/1

Y1 - 2015/2/1

N2 - Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

AB - Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

KW - Active learning

KW - Empirical risk minimization

KW - Maximum mean discrepancy

KW - Representative and discriminative

UR - http://www.scopus.com/inward/record.url?scp=84923696228&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84923696228&partnerID=8YFLogxK

U2 - 10.1145/2700408

DO - 10.1145/2700408

M3 - Article

AN - SCOPUS:84923696228

VL - 9

SP - 17

JO - ACM Transactions on Knowledge Discovery from Data

JF - ACM Transactions on Knowledge Discovery from Data

SN - 1556-4681

IS - 3

ER -