Resampling methods for variable selection in robust regression

James W. Wisnowski, James R. Simpson, Douglas Montgomery, George Runger

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

With the inundation of large data sets requiring analysis and empirical model building, outliers have become commonplace. Fortunately, several standard statistical software packages have allowed practitioners to use robust regression estimators to easily fit data sets that are contaminated with outliers. However, little guidance is available for selecting the best subset of the predictor variables when using these robust estimators. We initially consider cross-validation and bootstrap resampling methods that have performed well for least-squares variable selection. It turns out that these variable selection methods cannot be directly applied to contaminated data sets using a robust estimation scheme. The prediction errors, inflated by the outliers, are not reliable measures of how well the robust model fits the data. As a result, new resampling variable selection methods are proposed by introducing alternative estimates of prediction error in the contaminated model. We demonstrate that, although robust estimation and resampling variable selection are computationally complex procedures, we can combine both techniques for superior results using modest computational resources. Monte Carlo simulation is used to evaluate the proposed variable selection procedures against alternatives through a designed experiment approach. The experiment factors include percentage of outliers, outlier geometry, bootstrap sample size, number of bootstrap samples, and cross-validation assessment size. The results are summarized and recommendations for use are provided.

Original languageEnglish (US)
Pages (from-to)341-355
Number of pages15
JournalComputational Statistics and Data Analysis
Volume43
Issue number3
DOIs
StatePublished - Jul 28 2003

Fingerprint

Robust Regression
Resampling Methods
Variable Selection
Outlier
Robust Estimators
Robust Estimation
Resampling
Prediction Error
Cross-validation
Bootstrap
Software packages
Experiments
Statistical Software
Regression Estimator
Empirical Model
Alternatives
Bootstrap Method
Selection Procedures
Geometry
Software Package

Keywords

  • Bootstrap
  • Cross-validation
  • Outliers
  • Robust regression
  • Variable selection

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Statistics, Probability and Uncertainty
  • Electrical and Electronic Engineering
  • Computational Mathematics
  • Numerical Analysis
  • Statistics and Probability

Cite this

Resampling methods for variable selection in robust regression. / Wisnowski, James W.; Simpson, James R.; Montgomery, Douglas; Runger, George.

In: Computational Statistics and Data Analysis, Vol. 43, No. 3, 28.07.2003, p. 341-355.

Research output: Contribution to journalArticle

@article{bd633984587848ed94d5a328eda7e579,
title = "Resampling methods for variable selection in robust regression",
abstract = "With the inundation of large data sets requiring analysis and empirical model building, outliers have become commonplace. Fortunately, several standard statistical software packages have allowed practitioners to use robust regression estimators to easily fit data sets that are contaminated with outliers. However, little guidance is available for selecting the best subset of the predictor variables when using these robust estimators. We initially consider cross-validation and bootstrap resampling methods that have performed well for least-squares variable selection. It turns out that these variable selection methods cannot be directly applied to contaminated data sets using a robust estimation scheme. The prediction errors, inflated by the outliers, are not reliable measures of how well the robust model fits the data. As a result, new resampling variable selection methods are proposed by introducing alternative estimates of prediction error in the contaminated model. We demonstrate that, although robust estimation and resampling variable selection are computationally complex procedures, we can combine both techniques for superior results using modest computational resources. Monte Carlo simulation is used to evaluate the proposed variable selection procedures against alternatives through a designed experiment approach. The experiment factors include percentage of outliers, outlier geometry, bootstrap sample size, number of bootstrap samples, and cross-validation assessment size. The results are summarized and recommendations for use are provided.",
keywords = "Bootstrap, Cross-validation, Outliers, Robust regression, Variable selection",
author = "Wisnowski, {James W.} and Simpson, {James R.} and Douglas Montgomery and George Runger",
year = "2003",
month = "7",
day = "28",
doi = "10.1016/S0167-9473(02)00235-9",
language = "English (US)",
volume = "43",
pages = "341--355",
journal = "Computational Statistics and Data Analysis",
issn = "0167-9473",
publisher = "Elsevier",
number = "3",

}

TY - JOUR

T1 - Resampling methods for variable selection in robust regression

AU - Wisnowski, James W.

AU - Simpson, James R.

AU - Montgomery, Douglas

AU - Runger, George

PY - 2003/7/28

Y1 - 2003/7/28

N2 - With the inundation of large data sets requiring analysis and empirical model building, outliers have become commonplace. Fortunately, several standard statistical software packages have allowed practitioners to use robust regression estimators to easily fit data sets that are contaminated with outliers. However, little guidance is available for selecting the best subset of the predictor variables when using these robust estimators. We initially consider cross-validation and bootstrap resampling methods that have performed well for least-squares variable selection. It turns out that these variable selection methods cannot be directly applied to contaminated data sets using a robust estimation scheme. The prediction errors, inflated by the outliers, are not reliable measures of how well the robust model fits the data. As a result, new resampling variable selection methods are proposed by introducing alternative estimates of prediction error in the contaminated model. We demonstrate that, although robust estimation and resampling variable selection are computationally complex procedures, we can combine both techniques for superior results using modest computational resources. Monte Carlo simulation is used to evaluate the proposed variable selection procedures against alternatives through a designed experiment approach. The experiment factors include percentage of outliers, outlier geometry, bootstrap sample size, number of bootstrap samples, and cross-validation assessment size. The results are summarized and recommendations for use are provided.

AB - With the inundation of large data sets requiring analysis and empirical model building, outliers have become commonplace. Fortunately, several standard statistical software packages have allowed practitioners to use robust regression estimators to easily fit data sets that are contaminated with outliers. However, little guidance is available for selecting the best subset of the predictor variables when using these robust estimators. We initially consider cross-validation and bootstrap resampling methods that have performed well for least-squares variable selection. It turns out that these variable selection methods cannot be directly applied to contaminated data sets using a robust estimation scheme. The prediction errors, inflated by the outliers, are not reliable measures of how well the robust model fits the data. As a result, new resampling variable selection methods are proposed by introducing alternative estimates of prediction error in the contaminated model. We demonstrate that, although robust estimation and resampling variable selection are computationally complex procedures, we can combine both techniques for superior results using modest computational resources. Monte Carlo simulation is used to evaluate the proposed variable selection procedures against alternatives through a designed experiment approach. The experiment factors include percentage of outliers, outlier geometry, bootstrap sample size, number of bootstrap samples, and cross-validation assessment size. The results are summarized and recommendations for use are provided.

KW - Bootstrap

KW - Cross-validation

KW - Outliers

KW - Robust regression

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=0038502343&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0038502343&partnerID=8YFLogxK

U2 - 10.1016/S0167-9473(02)00235-9

DO - 10.1016/S0167-9473(02)00235-9

M3 - Article

VL - 43

SP - 341

EP - 355

JO - Computational Statistics and Data Analysis

JF - Computational Statistics and Data Analysis

SN - 0167-9473

IS - 3

ER -