Goodness-of-fit testing in sparse contingency tables when the number of variables is large

Research output: Contribution to journalReview article

Abstract

The Pearson and likelihood ratio statistics are commonly used to test goodness of fit for models applied to data from a multinomial distribution. When data are from a table formed by the cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness. One approach to finding a valid approximation to the achieved significance level (ASL) is to use a bootstrap distribution for the test statistic. For a composite null hypothesis with unknown parameters, the parametric bootstrap has been employed. The parametric bootstrap can be computationally demanding, but a recent development provides a method for computationally efficient calculation of the Pearson–Fisher statistic for very large sparse tables. Another approach employs orthogonal components of the Pearson–Fisher statistic obtained from lower-order marginal distributions of a large cross-classified table rather than the joint distribution. These statistics are used essentially for focused tests and have mostly been applied to latent variable models. They have very good performance for Type I error rate and power, even when applied to a sparse table. However, there are limitations when the number of variables becomes larger than 20. Some related statistics also employ lower-order marginals, but they are not components of the Pearson–Fisher statistic. The performance of these approaches is compared for obtaining a valid ASL for a goodness-of-fit test applied to a very large multi-way contingency table. The approaches are compared with a small simulation study. This article is categorized under: Types and Structure > Categorical Data Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling Statistical Models > Fitting Models.

Original languageEnglish (US)
Article numbere1470
JournalWiley Interdisciplinary Reviews: Computational Statistics
DOIs
StatePublished - Jan 1 2019

Fingerprint

Contingency Table
Goodness of fit
Statistic
Parametric Bootstrap
Table
Significance level
Goodness of Fit Test
Statistics
Bootstrap
Testing
Valid
Multinomial Distribution
Latent Variable Models
Graphical Methods
Likelihood Ratio Statistic
Type I Error Rate
Nominal or categorical data
Type I error
Model Fitting
Resampling

Keywords

  • Focused test
  • multivariate discrete distribution
  • orthogonal components
  • overlapping cells
  • parametric bootstrap
  • Pearson–Fisher statistic

ASJC Scopus subject areas

  • Statistics and Probability

Cite this

@article{cd3f1fb69668470fb25293339f0f0a07,
title = "Goodness-of-fit testing in sparse contingency tables when the number of variables is large",
abstract = "The Pearson and likelihood ratio statistics are commonly used to test goodness of fit for models applied to data from a multinomial distribution. When data are from a table formed by the cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness. One approach to finding a valid approximation to the achieved significance level (ASL) is to use a bootstrap distribution for the test statistic. For a composite null hypothesis with unknown parameters, the parametric bootstrap has been employed. The parametric bootstrap can be computationally demanding, but a recent development provides a method for computationally efficient calculation of the Pearson–Fisher statistic for very large sparse tables. Another approach employs orthogonal components of the Pearson–Fisher statistic obtained from lower-order marginal distributions of a large cross-classified table rather than the joint distribution. These statistics are used essentially for focused tests and have mostly been applied to latent variable models. They have very good performance for Type I error rate and power, even when applied to a sparse table. However, there are limitations when the number of variables becomes larger than 20. Some related statistics also employ lower-order marginals, but they are not components of the Pearson–Fisher statistic. The performance of these approaches is compared for obtaining a valid ASL for a goodness-of-fit test applied to a very large multi-way contingency table. The approaches are compared with a small simulation study. This article is categorized under: Types and Structure > Categorical Data Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling Statistical Models > Fitting Models.",
keywords = "Focused test, multivariate discrete distribution, orthogonal components, overlapping cells, parametric bootstrap, Pearson–Fisher statistic",
author = "Mark Reiser",
year = "2019",
month = "1",
day = "1",
doi = "10.1002/wics.1470",
language = "English (US)",
journal = "Wiley Interdisciplinary Reviews: Computational Statistics",
issn = "1939-5108",
publisher = "John Wiley and Sons Inc.",

}

TY - JOUR

T1 - Goodness-of-fit testing in sparse contingency tables when the number of variables is large

AU - Reiser, Mark

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The Pearson and likelihood ratio statistics are commonly used to test goodness of fit for models applied to data from a multinomial distribution. When data are from a table formed by the cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness. One approach to finding a valid approximation to the achieved significance level (ASL) is to use a bootstrap distribution for the test statistic. For a composite null hypothesis with unknown parameters, the parametric bootstrap has been employed. The parametric bootstrap can be computationally demanding, but a recent development provides a method for computationally efficient calculation of the Pearson–Fisher statistic for very large sparse tables. Another approach employs orthogonal components of the Pearson–Fisher statistic obtained from lower-order marginal distributions of a large cross-classified table rather than the joint distribution. These statistics are used essentially for focused tests and have mostly been applied to latent variable models. They have very good performance for Type I error rate and power, even when applied to a sparse table. However, there are limitations when the number of variables becomes larger than 20. Some related statistics also employ lower-order marginals, but they are not components of the Pearson–Fisher statistic. The performance of these approaches is compared for obtaining a valid ASL for a goodness-of-fit test applied to a very large multi-way contingency table. The approaches are compared with a small simulation study. This article is categorized under: Types and Structure > Categorical Data Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling Statistical Models > Fitting Models.

AB - The Pearson and likelihood ratio statistics are commonly used to test goodness of fit for models applied to data from a multinomial distribution. When data are from a table formed by the cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness. One approach to finding a valid approximation to the achieved significance level (ASL) is to use a bootstrap distribution for the test statistic. For a composite null hypothesis with unknown parameters, the parametric bootstrap has been employed. The parametric bootstrap can be computationally demanding, but a recent development provides a method for computationally efficient calculation of the Pearson–Fisher statistic for very large sparse tables. Another approach employs orthogonal components of the Pearson–Fisher statistic obtained from lower-order marginal distributions of a large cross-classified table rather than the joint distribution. These statistics are used essentially for focused tests and have mostly been applied to latent variable models. They have very good performance for Type I error rate and power, even when applied to a sparse table. However, there are limitations when the number of variables becomes larger than 20. Some related statistics also employ lower-order marginals, but they are not components of the Pearson–Fisher statistic. The performance of these approaches is compared for obtaining a valid ASL for a goodness-of-fit test applied to a very large multi-way contingency table. The approaches are compared with a small simulation study. This article is categorized under: Types and Structure > Categorical Data Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling Statistical Models > Fitting Models.

KW - Focused test

KW - multivariate discrete distribution

KW - orthogonal components

KW - overlapping cells

KW - parametric bootstrap

KW - Pearson–Fisher statistic

UR - http://www.scopus.com/inward/record.url?scp=85067359076&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067359076&partnerID=8YFLogxK

U2 - 10.1002/wics.1470

DO - 10.1002/wics.1470

M3 - Review article

AN - SCOPUS:85067359076

JO - Wiley Interdisciplinary Reviews: Computational Statistics

JF - Wiley Interdisciplinary Reviews: Computational Statistics

SN - 1939-5108

M1 - e1470

ER -