Abstract

Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

Original languageEnglish (US)
Pages (from-to)1-24
Number of pages24
JournalData Mining and Knowledge Discovery
DOIs
StateAccepted/In press - Sep 28 2016

Fingerprint

Labeling
Sensor networks
Data mining
Wireless networks
Costs

Keywords

  • Incremental learning
  • Rare category detection
  • Time-evolving graph

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

Discovering rare categories from graph streams. / Zhou, Dawei; Karthikeyan, Arun; Wang, Kangyang; Cao, Nan; He, Jingrui.

In: Data Mining and Knowledge Discovery, 28.09.2016, p. 1-24.

Research output: Contribution to journalArticle

Zhou, Dawei ; Karthikeyan, Arun ; Wang, Kangyang ; Cao, Nan ; He, Jingrui. / Discovering rare categories from graph streams. In: Data Mining and Knowledge Discovery. 2016 ; pp. 1-24.
@article{01f9cef709bf47ffb9300940aaac27cc,
title = "Discovering rare categories from graph streams",
abstract = "Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.",
keywords = "Incremental learning, Rare category detection, Time-evolving graph",
author = "Dawei Zhou and Arun Karthikeyan and Kangyang Wang and Nan Cao and Jingrui He",
year = "2016",
month = "9",
day = "28",
doi = "10.1007/s10618-016-0478-6",
language = "English (US)",
pages = "1--24",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",

}

TY - JOUR

T1 - Discovering rare categories from graph streams

AU - Zhou, Dawei

AU - Karthikeyan, Arun

AU - Wang, Kangyang

AU - Cao, Nan

AU - He, Jingrui

PY - 2016/9/28

Y1 - 2016/9/28

N2 - Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

AB - Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

KW - Incremental learning

KW - Rare category detection

KW - Time-evolving graph

UR - http://www.scopus.com/inward/record.url?scp=84988950684&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988950684&partnerID=8YFLogxK

U2 - 10.1007/s10618-016-0478-6

DO - 10.1007/s10618-016-0478-6

M3 - Article

SP - 1

EP - 24

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

ER -