Hype and heavy tails

A closer look at data breaches

Benjamin Edwards, Steven Hofmeyr, Stephanie Forrest

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

Recent widely publicized data breaches have exposed the personal information of hundreds of millions of people. Some reports point to alarming increases in both the size and frequency of data breaches, spurring institutions around the world to address what appears to be a worsening situation. But, is the problem actually growing worse? In this article, we study a popular public dataset and develop Bayesian Generalized Linear Models to investigate trends in data breaches. Analysis of the model shows that neither size nor frequency of data breaches has increased over the past decade. We find that the increases that have attracted attention can be explained by the heavytailed statistical distributions underlying the dataset. Specifically, we find that the size of data breaches is well modeled by the log-normal family of distributions and that the daily frequency of breaches is described by a negative binomial distribution. These distributions may provide clues to the generative mechanisms that are responsible for the breaches. Additionally, our model predicts the likelihood of breaches of a particular size in the future. For example, we find that between 15 September 2015 and 16 September 2016 there is only a 53.6% chance of a breach of 10 million records or more in the USA. Regardless of any trend, data breaches are costly, and we combine the model with two different cost models to project that in the next 3 years breaches could cost up to $179 billion.

Original languageEnglish (US)
Pages (from-to)3-14
Number of pages12
JournalJournal of Cybersecurity
Volume2
Issue number1
DOIs
StatePublished - Dec 1 2016
Externally publishedYes

Fingerprint

Binomial Distribution
Statistical Distributions
Costs and Cost Analysis
Normal Distribution
Linear Models
trend
Costs
costs
linear model
Datasets

Keywords

  • Bayesian linear model
  • Data breaches
  • Heavy tails
  • Log-normal
  • Negative binomial

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Networks and Communications
  • Safety, Risk, Reliability and Quality
  • Social Psychology
  • Law
  • Political Science and International Relations

Cite this

Hype and heavy tails : A closer look at data breaches. / Edwards, Benjamin; Hofmeyr, Steven; Forrest, Stephanie.

In: Journal of Cybersecurity, Vol. 2, No. 1, 01.12.2016, p. 3-14.

Research output: Contribution to journalArticle

Edwards, Benjamin ; Hofmeyr, Steven ; Forrest, Stephanie. / Hype and heavy tails : A closer look at data breaches. In: Journal of Cybersecurity. 2016 ; Vol. 2, No. 1. pp. 3-14.
@article{270a3c152eab4a98a03ca42c7d262e61,
title = "Hype and heavy tails: A closer look at data breaches",
abstract = "Recent widely publicized data breaches have exposed the personal information of hundreds of millions of people. Some reports point to alarming increases in both the size and frequency of data breaches, spurring institutions around the world to address what appears to be a worsening situation. But, is the problem actually growing worse? In this article, we study a popular public dataset and develop Bayesian Generalized Linear Models to investigate trends in data breaches. Analysis of the model shows that neither size nor frequency of data breaches has increased over the past decade. We find that the increases that have attracted attention can be explained by the heavytailed statistical distributions underlying the dataset. Specifically, we find that the size of data breaches is well modeled by the log-normal family of distributions and that the daily frequency of breaches is described by a negative binomial distribution. These distributions may provide clues to the generative mechanisms that are responsible for the breaches. Additionally, our model predicts the likelihood of breaches of a particular size in the future. For example, we find that between 15 September 2015 and 16 September 2016 there is only a 53.6{\%} chance of a breach of 10 million records or more in the USA. Regardless of any trend, data breaches are costly, and we combine the model with two different cost models to project that in the next 3 years breaches could cost up to $179 billion.",
keywords = "Bayesian linear model, Data breaches, Heavy tails, Log-normal, Negative binomial",
author = "Benjamin Edwards and Steven Hofmeyr and Stephanie Forrest",
year = "2016",
month = "12",
day = "1",
doi = "10.1093/cybsec/tyw003",
language = "English (US)",
volume = "2",
pages = "3--14",
journal = "Journal of Cybersecurity",
issn = "2057-2093",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - Hype and heavy tails

T2 - A closer look at data breaches

AU - Edwards, Benjamin

AU - Hofmeyr, Steven

AU - Forrest, Stephanie

PY - 2016/12/1

Y1 - 2016/12/1

N2 - Recent widely publicized data breaches have exposed the personal information of hundreds of millions of people. Some reports point to alarming increases in both the size and frequency of data breaches, spurring institutions around the world to address what appears to be a worsening situation. But, is the problem actually growing worse? In this article, we study a popular public dataset and develop Bayesian Generalized Linear Models to investigate trends in data breaches. Analysis of the model shows that neither size nor frequency of data breaches has increased over the past decade. We find that the increases that have attracted attention can be explained by the heavytailed statistical distributions underlying the dataset. Specifically, we find that the size of data breaches is well modeled by the log-normal family of distributions and that the daily frequency of breaches is described by a negative binomial distribution. These distributions may provide clues to the generative mechanisms that are responsible for the breaches. Additionally, our model predicts the likelihood of breaches of a particular size in the future. For example, we find that between 15 September 2015 and 16 September 2016 there is only a 53.6% chance of a breach of 10 million records or more in the USA. Regardless of any trend, data breaches are costly, and we combine the model with two different cost models to project that in the next 3 years breaches could cost up to $179 billion.

AB - Recent widely publicized data breaches have exposed the personal information of hundreds of millions of people. Some reports point to alarming increases in both the size and frequency of data breaches, spurring institutions around the world to address what appears to be a worsening situation. But, is the problem actually growing worse? In this article, we study a popular public dataset and develop Bayesian Generalized Linear Models to investigate trends in data breaches. Analysis of the model shows that neither size nor frequency of data breaches has increased over the past decade. We find that the increases that have attracted attention can be explained by the heavytailed statistical distributions underlying the dataset. Specifically, we find that the size of data breaches is well modeled by the log-normal family of distributions and that the daily frequency of breaches is described by a negative binomial distribution. These distributions may provide clues to the generative mechanisms that are responsible for the breaches. Additionally, our model predicts the likelihood of breaches of a particular size in the future. For example, we find that between 15 September 2015 and 16 September 2016 there is only a 53.6% chance of a breach of 10 million records or more in the USA. Regardless of any trend, data breaches are costly, and we combine the model with two different cost models to project that in the next 3 years breaches could cost up to $179 billion.

KW - Bayesian linear model

KW - Data breaches

KW - Heavy tails

KW - Log-normal

KW - Negative binomial

UR - http://www.scopus.com/inward/record.url?scp=85029531911&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029531911&partnerID=8YFLogxK

U2 - 10.1093/cybsec/tyw003

DO - 10.1093/cybsec/tyw003

M3 - Article

VL - 2

SP - 3

EP - 14

JO - Journal of Cybersecurity

JF - Journal of Cybersecurity

SN - 2057-2093

IS - 1

ER -