Identification of outliers through clustering and semi-supervised learning for all sky surveys

Sharmodeep Bhattacharyya, Joseph W. Richards, John Rice, Dan L. Starr, Nathaniel Butler, Joshua S. Bloom

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recently there has been a huge surge of data in astronomy, making outlier or novelty detection a crucial step in analyzing these data. Here, we introduce a clustering based semi-supervised approach for outlier detection. The training data, (X1,Y1), . . . , (Xn,Yn), where n = 1,542, comes from Hipparcos and Optical Gravitational Lensing Experiment (OGLE) surveys, with, Xi ∈ Rp (p = 64) as the features and Yi is a categorical variable having one of the 25 class labels. The set of 64 periodic and non-periodic features are extracted from the light curves. The test data, Z1, . . . ,Zm, where m = 11,375, is the test data, where, Zi ∈ Rp.We select these 11,375 low noise variable light sources for our analysis from a set of unlabeled light curves of ∼50,000 variable light sources from All Sky Automated Survey (ASAS). Our goal is to find outlier data points in the unlabeled data set whose labels can not be properly predicted by the information in the labeled data set. We propose a new hierarchical algorithm for outlier detection in this partially labeled setup based on clustering and semi-supervised learning.We apply our method to identify interesting sources in the ASAS data set, with the training data. We present the ASAS light curves of some of these interesting sources, and elaborate on the possible physical mechanisms driving their variability.

Original languageEnglish (US)
Title of host publicationLecture Notes in Statistics
PublisherSpringer Science and Business Media, LLC
Pages483-485
Number of pages3
Volume209
ISBN (Print)9781461435198
StatePublished - 2012
Externally publishedYes
Event5th Statistical Challenges in Modern Astronomy Symposium, SCMA 2011 - University Park, PA, United States
Duration: Jun 13 2011Jun 15 2011

Other

Other5th Statistical Challenges in Modern Astronomy Symposium, SCMA 2011
CountryUnited States
CityUniversity Park, PA
Period6/13/116/15/11

Fingerprint

Semi-supervised Learning
Outlier
Clustering
Outlier Detection
Curve
Novelty Detection
Gravitational Lensing
Semi-supervised learning
Outliers
Categorical variable
Surge
Astronomy
Survey Data
Outlier detection

ASJC Scopus subject areas

  • Statistics, Probability and Uncertainty
  • Statistics and Probability

Cite this

Bhattacharyya, S., Richards, J. W., Rice, J., Starr, D. L., Butler, N., & Bloom, J. S. (2012). Identification of outliers through clustering and semi-supervised learning for all sky surveys. In Lecture Notes in Statistics (Vol. 209, pp. 483-485). Springer Science and Business Media, LLC.

Identification of outliers through clustering and semi-supervised learning for all sky surveys. / Bhattacharyya, Sharmodeep; Richards, Joseph W.; Rice, John; Starr, Dan L.; Butler, Nathaniel; Bloom, Joshua S.

Lecture Notes in Statistics. Vol. 209 Springer Science and Business Media, LLC, 2012. p. 483-485.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bhattacharyya, S, Richards, JW, Rice, J, Starr, DL, Butler, N & Bloom, JS 2012, Identification of outliers through clustering and semi-supervised learning for all sky surveys. in Lecture Notes in Statistics. vol. 209, Springer Science and Business Media, LLC, pp. 483-485, 5th Statistical Challenges in Modern Astronomy Symposium, SCMA 2011, University Park, PA, United States, 6/13/11.
Bhattacharyya S, Richards JW, Rice J, Starr DL, Butler N, Bloom JS. Identification of outliers through clustering and semi-supervised learning for all sky surveys. In Lecture Notes in Statistics. Vol. 209. Springer Science and Business Media, LLC. 2012. p. 483-485
Bhattacharyya, Sharmodeep ; Richards, Joseph W. ; Rice, John ; Starr, Dan L. ; Butler, Nathaniel ; Bloom, Joshua S. / Identification of outliers through clustering and semi-supervised learning for all sky surveys. Lecture Notes in Statistics. Vol. 209 Springer Science and Business Media, LLC, 2012. pp. 483-485
@inproceedings{550592a8e0374f5f921926a4aebff4b5,
title = "Identification of outliers through clustering and semi-supervised learning for all sky surveys",
abstract = "Recently there has been a huge surge of data in astronomy, making outlier or novelty detection a crucial step in analyzing these data. Here, we introduce a clustering based semi-supervised approach for outlier detection. The training data, (X1,Y1), . . . , (Xn,Yn), where n = 1,542, comes from Hipparcos and Optical Gravitational Lensing Experiment (OGLE) surveys, with, Xi ∈ Rp (p = 64) as the features and Yi is a categorical variable having one of the 25 class labels. The set of 64 periodic and non-periodic features are extracted from the light curves. The test data, Z1, . . . ,Zm, where m = 11,375, is the test data, where, Zi ∈ Rp.We select these 11,375 low noise variable light sources for our analysis from a set of unlabeled light curves of ∼50,000 variable light sources from All Sky Automated Survey (ASAS). Our goal is to find outlier data points in the unlabeled data set whose labels can not be properly predicted by the information in the labeled data set. We propose a new hierarchical algorithm for outlier detection in this partially labeled setup based on clustering and semi-supervised learning.We apply our method to identify interesting sources in the ASAS data set, with the training data. We present the ASAS light curves of some of these interesting sources, and elaborate on the possible physical mechanisms driving their variability.",
author = "Sharmodeep Bhattacharyya and Richards, {Joseph W.} and John Rice and Starr, {Dan L.} and Nathaniel Butler and Bloom, {Joshua S.}",
year = "2012",
language = "English (US)",
isbn = "9781461435198",
volume = "209",
pages = "483--485",
booktitle = "Lecture Notes in Statistics",
publisher = "Springer Science and Business Media, LLC",

}

TY - GEN

T1 - Identification of outliers through clustering and semi-supervised learning for all sky surveys

AU - Bhattacharyya, Sharmodeep

AU - Richards, Joseph W.

AU - Rice, John

AU - Starr, Dan L.

AU - Butler, Nathaniel

AU - Bloom, Joshua S.

PY - 2012

Y1 - 2012

N2 - Recently there has been a huge surge of data in astronomy, making outlier or novelty detection a crucial step in analyzing these data. Here, we introduce a clustering based semi-supervised approach for outlier detection. The training data, (X1,Y1), . . . , (Xn,Yn), where n = 1,542, comes from Hipparcos and Optical Gravitational Lensing Experiment (OGLE) surveys, with, Xi ∈ Rp (p = 64) as the features and Yi is a categorical variable having one of the 25 class labels. The set of 64 periodic and non-periodic features are extracted from the light curves. The test data, Z1, . . . ,Zm, where m = 11,375, is the test data, where, Zi ∈ Rp.We select these 11,375 low noise variable light sources for our analysis from a set of unlabeled light curves of ∼50,000 variable light sources from All Sky Automated Survey (ASAS). Our goal is to find outlier data points in the unlabeled data set whose labels can not be properly predicted by the information in the labeled data set. We propose a new hierarchical algorithm for outlier detection in this partially labeled setup based on clustering and semi-supervised learning.We apply our method to identify interesting sources in the ASAS data set, with the training data. We present the ASAS light curves of some of these interesting sources, and elaborate on the possible physical mechanisms driving their variability.

AB - Recently there has been a huge surge of data in astronomy, making outlier or novelty detection a crucial step in analyzing these data. Here, we introduce a clustering based semi-supervised approach for outlier detection. The training data, (X1,Y1), . . . , (Xn,Yn), where n = 1,542, comes from Hipparcos and Optical Gravitational Lensing Experiment (OGLE) surveys, with, Xi ∈ Rp (p = 64) as the features and Yi is a categorical variable having one of the 25 class labels. The set of 64 periodic and non-periodic features are extracted from the light curves. The test data, Z1, . . . ,Zm, where m = 11,375, is the test data, where, Zi ∈ Rp.We select these 11,375 low noise variable light sources for our analysis from a set of unlabeled light curves of ∼50,000 variable light sources from All Sky Automated Survey (ASAS). Our goal is to find outlier data points in the unlabeled data set whose labels can not be properly predicted by the information in the labeled data set. We propose a new hierarchical algorithm for outlier detection in this partially labeled setup based on clustering and semi-supervised learning.We apply our method to identify interesting sources in the ASAS data set, with the training data. We present the ASAS light curves of some of these interesting sources, and elaborate on the possible physical mechanisms driving their variability.

UR - http://www.scopus.com/inward/record.url?scp=84896663765&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84896663765&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84896663765

SN - 9781461435198

VL - 209

SP - 483

EP - 485

BT - Lecture Notes in Statistics

PB - Springer Science and Business Media, LLC

ER -