Bayes and big data: The consensus monte carlo algorithm

Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, Robert McCulloch

Research output: Contribution to journalArticle

75 Citations (Scopus)

Abstract

A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single-machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).

Original languageEnglish (US)
Pages (from-to)78-88
Number of pages11
JournalInternational Journal of Management Science and Engineering Management
Volume11
Issue number2
DOIs
StatePublished - 2016
Externally publishedYes

Fingerprint

Data storage equipment
Communication
Big data
Single machine
Graphics processing unit
Hierarchical model
Regression tree

Keywords

  • Bayesian inference
  • Big data
  • Distributed computing
  • Embarrassingly parallel
  • Markov chain Monte Carlo

ASJC Scopus subject areas

  • Strategy and Management
  • Information Systems and Management
  • Management Science and Operations Research
  • Mechanical Engineering
  • Engineering (miscellaneous)

Cite this

Bayes and big data : The consensus monte carlo algorithm. / Scott, Steven L.; Blocker, Alexander W.; Bonassi, Fernando V.; Chipman, Hugh A.; George, Edward I.; McCulloch, Robert.

In: International Journal of Management Science and Engineering Management, Vol. 11, No. 2, 2016, p. 78-88.

Research output: Contribution to journalArticle

Scott, Steven L. ; Blocker, Alexander W. ; Bonassi, Fernando V. ; Chipman, Hugh A. ; George, Edward I. ; McCulloch, Robert. / Bayes and big data : The consensus monte carlo algorithm. In: International Journal of Management Science and Engineering Management. 2016 ; Vol. 11, No. 2. pp. 78-88.
@article{d58fbd41bb21419fbe2299aa295a1b89,
title = "Bayes and big data: The consensus monte carlo algorithm",
abstract = "A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single-machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).",
keywords = "Bayesian inference, Big data, Distributed computing, Embarrassingly parallel, Markov chain Monte Carlo",
author = "Scott, {Steven L.} and Blocker, {Alexander W.} and Bonassi, {Fernando V.} and Chipman, {Hugh A.} and George, {Edward I.} and Robert McCulloch",
year = "2016",
doi = "10.1080/17509653.2016.1142191",
language = "English (US)",
volume = "11",
pages = "78--88",
journal = "International Journal of Management Science and Engineering Management",
issn = "1750-9653",
publisher = "Taylor and Francis Ltd.",
number = "2",

}

TY - JOUR

T1 - Bayes and big data

T2 - The consensus monte carlo algorithm

AU - Scott, Steven L.

AU - Blocker, Alexander W.

AU - Bonassi, Fernando V.

AU - Chipman, Hugh A.

AU - George, Edward I.

AU - McCulloch, Robert

PY - 2016

Y1 - 2016

N2 - A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single-machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).

AB - A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single-machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).

KW - Bayesian inference

KW - Big data

KW - Distributed computing

KW - Embarrassingly parallel

KW - Markov chain Monte Carlo

UR - http://www.scopus.com/inward/record.url?scp=85015380095&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015380095&partnerID=8YFLogxK

U2 - 10.1080/17509653.2016.1142191

DO - 10.1080/17509653.2016.1142191

M3 - Article

AN - SCOPUS:85015380095

VL - 11

SP - 78

EP - 88

JO - International Journal of Management Science and Engineering Management

JF - International Journal of Management Science and Engineering Management

SN - 1750-9653

IS - 2

ER -