High-dimensional disease outbreak detection using tree-based ensembles

Saylisse Dávila; George Runger; Eugene Tuv; Paola Pacheco

High-dimensional disease outbreak detection using tree-based ensembles

Saylisse Dávila, George Runger, Eugene Tuv, Paola Pacheco

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

A common goal of most public health surveillance programs is to detect disease outbreaks before they become a threat to the public. In this work, we propose a novel and computationally feasible approach to this problem. By tackling public health surveillance with a supervised learner that can handle high-dimensional, mixed-type data, and even missing values; we developed a method that can accurately detect changes in disease incidence rates, even in high-dimensions. We use probability estimates from random forests to develop an alternative signal criterion that can detect when there is a concentration of disease incidences within a particular geographic region and/or subpopulation that is unlikely to have occurred by chance. A series of simulated experiments suggest this method is able to accurately detect the presence of disease clusters, on average, 88% of time. Simulated results also suggest a feasible combination of the method's parameters that can significantly reduce the computational complexity of the method to an average system time of 1.9 minutes (s = 0.48 minutes) for a data set containing 1,000 incidences running on an Intel Core i5 processor.

Original language	English (US)
Title of host publication	IIE Annual Conference and Expo 2013
Publisher	Institute of Industrial Engineers
Pages	2551-2560
Number of pages	10
State	Published - 2013
Event	IIE Annual Conference and Expo 2013 - San Juan, Puerto Rico Duration: May 18 2013 → May 22 2013

Other

Other	IIE Annual Conference and Expo 2013
Country/Territory	Puerto Rico
City	San Juan
Period	5/18/13 → 5/22/13

ASJC Scopus subject areas

Industrial and Manufacturing Engineering

Cite this

@inproceedings{7d8cf6730d7b491c8ef0da04039b5207,

title = "High-dimensional disease outbreak detection using tree-based ensembles",

abstract = "A common goal of most public health surveillance programs is to detect disease outbreaks before they become a threat to the public. In this work, we propose a novel and computationally feasible approach to this problem. By tackling public health surveillance with a supervised learner that can handle high-dimensional, mixed-type data, and even missing values; we developed a method that can accurately detect changes in disease incidence rates, even in high-dimensions. We use probability estimates from random forests to develop an alternative signal criterion that can detect when there is a concentration of disease incidences within a particular geographic region and/or subpopulation that is unlikely to have occurred by chance. A series of simulated experiments suggest this method is able to accurately detect the presence of disease clusters, on average, 88% of time. Simulated results also suggest a feasible combination of the method's parameters that can significantly reduce the computational complexity of the method to an average system time of 1.9 minutes (s = 0.48 minutes) for a data set containing 1,000 incidences running on an Intel Core i5 processor.",

author = "Saylisse D{\'a}vila and George Runger and Eugene Tuv and Paola Pacheco",

year = "2013",

language = "English (US)",

pages = "2551--2560",

booktitle = "IIE Annual Conference and Expo 2013",

publisher = "Institute of Industrial Engineers",

note = "IIE Annual Conference and Expo 2013 ; Conference date: 18-05-2013 Through 22-05-2013",

}

TY - GEN

T1 - High-dimensional disease outbreak detection using tree-based ensembles

AU - Dávila, Saylisse

AU - Runger, George

AU - Tuv, Eugene

AU - Pacheco, Paola

PY - 2013

Y1 - 2013

N2 - A common goal of most public health surveillance programs is to detect disease outbreaks before they become a threat to the public. In this work, we propose a novel and computationally feasible approach to this problem. By tackling public health surveillance with a supervised learner that can handle high-dimensional, mixed-type data, and even missing values; we developed a method that can accurately detect changes in disease incidence rates, even in high-dimensions. We use probability estimates from random forests to develop an alternative signal criterion that can detect when there is a concentration of disease incidences within a particular geographic region and/or subpopulation that is unlikely to have occurred by chance. A series of simulated experiments suggest this method is able to accurately detect the presence of disease clusters, on average, 88% of time. Simulated results also suggest a feasible combination of the method's parameters that can significantly reduce the computational complexity of the method to an average system time of 1.9 minutes (s = 0.48 minutes) for a data set containing 1,000 incidences running on an Intel Core i5 processor.

AB - A common goal of most public health surveillance programs is to detect disease outbreaks before they become a threat to the public. In this work, we propose a novel and computationally feasible approach to this problem. By tackling public health surveillance with a supervised learner that can handle high-dimensional, mixed-type data, and even missing values; we developed a method that can accurately detect changes in disease incidence rates, even in high-dimensions. We use probability estimates from random forests to develop an alternative signal criterion that can detect when there is a concentration of disease incidences within a particular geographic region and/or subpopulation that is unlikely to have occurred by chance. A series of simulated experiments suggest this method is able to accurately detect the presence of disease clusters, on average, 88% of time. Simulated results also suggest a feasible combination of the method's parameters that can significantly reduce the computational complexity of the method to an average system time of 1.9 minutes (s = 0.48 minutes) for a data set containing 1,000 incidences running on an Intel Core i5 processor.

UR - http://www.scopus.com/inward/record.url?scp=84900334917&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84900334917&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84900334917

SP - 2551

EP - 2560

BT - IIE Annual Conference and Expo 2013

PB - Institute of Industrial Engineers

T2 - IIE Annual Conference and Expo 2013

Y2 - 18 May 2013 through 22 May 2013

ER -

High-dimensional disease outbreak detection using tree-based ensembles

Abstract

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this