A clustering algorithm for identifying multiple outliers in linear regression

David M. Sebert, Douglas Montgomery, Dwayne A. Rollier

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

Identifying outliers is a fundamental step in the regression model building process. However, current outlier diagnostics are often inadequate when data sets contain multiple outlying observations. This paper proposes a new clustering-based approach for multiple outlier identification that utilizes the predicted and residual values obtained from a least squares fit of the data. The procedure is described and is shown to perform well on classic multiple-outlier data sets found in the literature. Also, the performance characteristics of the proposed methodology are demonstrated and explored by applying the procedure to simulated data sets that have various outlier scenarios.

Original languageEnglish (US)
Pages (from-to)461-484
Number of pages24
JournalComputational Statistics and Data Analysis
Volume27
Issue number4
DOIs
StatePublished - Jun 5 1998

Fingerprint

Linear regression
Clustering algorithms
Outlier
Clustering Algorithm
Outlier Identification
Least Squares
Regression Model
Diagnostics
Clustering
Scenarios
Clustering algorithm
Outliers
Methodology

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Statistics, Probability and Uncertainty
  • Electrical and Electronic Engineering
  • Computational Mathematics
  • Numerical Analysis
  • Statistics and Probability

Cite this

A clustering algorithm for identifying multiple outliers in linear regression. / Sebert, David M.; Montgomery, Douglas; Rollier, Dwayne A.

In: Computational Statistics and Data Analysis, Vol. 27, No. 4, 05.06.1998, p. 461-484.

Research output: Contribution to journalArticle

@article{b43249ee8bb042f4844f09a795e4e364,
title = "A clustering algorithm for identifying multiple outliers in linear regression",
abstract = "Identifying outliers is a fundamental step in the regression model building process. However, current outlier diagnostics are often inadequate when data sets contain multiple outlying observations. This paper proposes a new clustering-based approach for multiple outlier identification that utilizes the predicted and residual values obtained from a least squares fit of the data. The procedure is described and is shown to perform well on classic multiple-outlier data sets found in the literature. Also, the performance characteristics of the proposed methodology are demonstrated and explored by applying the procedure to simulated data sets that have various outlier scenarios.",
author = "Sebert, {David M.} and Douglas Montgomery and Rollier, {Dwayne A.}",
year = "1998",
month = "6",
day = "5",
doi = "10.1016/S0167-9473(98)00021-8",
language = "English (US)",
volume = "27",
pages = "461--484",
journal = "Computational Statistics and Data Analysis",
issn = "0167-9473",
publisher = "Elsevier",
number = "4",

}

TY - JOUR

T1 - A clustering algorithm for identifying multiple outliers in linear regression

AU - Sebert, David M.

AU - Montgomery, Douglas

AU - Rollier, Dwayne A.

PY - 1998/6/5

Y1 - 1998/6/5

N2 - Identifying outliers is a fundamental step in the regression model building process. However, current outlier diagnostics are often inadequate when data sets contain multiple outlying observations. This paper proposes a new clustering-based approach for multiple outlier identification that utilizes the predicted and residual values obtained from a least squares fit of the data. The procedure is described and is shown to perform well on classic multiple-outlier data sets found in the literature. Also, the performance characteristics of the proposed methodology are demonstrated and explored by applying the procedure to simulated data sets that have various outlier scenarios.

AB - Identifying outliers is a fundamental step in the regression model building process. However, current outlier diagnostics are often inadequate when data sets contain multiple outlying observations. This paper proposes a new clustering-based approach for multiple outlier identification that utilizes the predicted and residual values obtained from a least squares fit of the data. The procedure is described and is shown to perform well on classic multiple-outlier data sets found in the literature. Also, the performance characteristics of the proposed methodology are demonstrated and explored by applying the procedure to simulated data sets that have various outlier scenarios.

UR - http://www.scopus.com/inward/record.url?scp=0032486019&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032486019&partnerID=8YFLogxK

U2 - 10.1016/S0167-9473(98)00021-8

DO - 10.1016/S0167-9473(98)00021-8

M3 - Article

AN - SCOPUS:0032486019

VL - 27

SP - 461

EP - 484

JO - Computational Statistics and Data Analysis

JF - Computational Statistics and Data Analysis

SN - 0167-9473

IS - 4

ER -