A practical data repository for causal learning with big data

Lu Cheng; Ruocheng Guo; Raha Moraffah; K. Selçuk Candan; Adrienne Raglin; Huan Liu

doi:10.1007/978-3-030-49556-5_23

A practical data repository for causal learning with big data

Lu Cheng, Ruocheng Guo, Raha Moraffah, K. Selçuk Candan, Adrienne Raglin, Huan Liu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

Original language	English (US)
Title of host publication	Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers
Editors	Wanling Gao, Jianfeng Zhan, Geoffrey Fox, Xiaoyi Lu, Dan Stanzione
Publisher	Springer
Pages	234-248
Number of pages	15
ISBN (Print)	9783030495558
DOIs	https://doi.org/10.1007/978-3-030-49556-5_23
State	Published - 2020
Event	2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019 - Denver, United States Duration: Nov 14 2019 → Nov 16 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12093 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019
Country/Territory	United States
City	Denver
Period	11/14/19 → 11/16/19

Keywords

Benchmarking
Big data
Causal discovery
Causal learning
Datasets
Treatment effect estimation

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-49556-5_23

Cite this

Cheng, L., Guo, R., Moraffah, R., Candan, K. S., Raglin, A., & Liu, H. (2020). A practical data repository for causal learning with big data. In W. Gao, J. Zhan, G. Fox, X. Lu, & D. Stanzione (Eds.), Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers (pp. 234-248). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12093 LNCS). Springer. https://doi.org/10.1007/978-3-030-49556-5_23

A practical data repository for causal learning with big data. / Cheng, Lu; Guo, Ruocheng; Moraffah, Raha et al.
Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers. ed. / Wanling Gao; Jianfeng Zhan; Geoffrey Fox; Xiaoyi Lu; Dan Stanzione. Springer, 2020. p. 234-248 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12093 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Cheng, L, Guo, R, Moraffah, R, Candan, KS, Raglin, A & Liu, H 2020, A practical data repository for causal learning with big data. in W Gao, J Zhan, G Fox, X Lu & D Stanzione (eds), Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12093 LNCS, Springer, pp. 234-248, 2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019, Denver, United States, 11/14/19. https://doi.org/10.1007/978-3-030-49556-5_23

Cheng L, Guo R, Moraffah R, Candan KS, Raglin A, Liu H. A practical data repository for causal learning with big data. In Gao W, Zhan J, Fox G, Lu X, Stanzione D, editors, Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers. Springer. 2020. p. 234-248. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-49556-5_23

Cheng, Lu ; Guo, Ruocheng ; Moraffah, Raha et al. / A practical data repository for causal learning with big data. Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers. editor / Wanling Gao ; Jianfeng Zhan ; Geoffrey Fox ; Xiaoyi Lu ; Dan Stanzione. Springer, 2020. pp. 234-248 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{cb0d50e5bd994dab814908f871346652,

title = "A practical data repository for causal learning with big data",

abstract = "The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.",

keywords = "Benchmarking, Big data, Causal discovery, Causal learning, Datasets, Treatment effect estimation",

author = "Lu Cheng and Ruocheng Guo and Raha Moraffah and Candan, {K. Sel{\c c}uk} and Adrienne Raglin and Huan Liu",

year = "2020",

doi = "10.1007/978-3-030-49556-5_23",

language = "English (US)",

isbn = "9783030495558",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "234--248",

editor = "Wanling Gao and Jianfeng Zhan and Geoffrey Fox and Xiaoyi Lu and Dan Stanzione",

booktitle = "Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers",

note = "2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019 ; Conference date: 14-11-2019 Through 16-11-2019",

}

TY - GEN

T1 - A practical data repository for causal learning with big data

AU - Cheng, Lu

AU - Guo, Ruocheng

AU - Moraffah, Raha

AU - Candan, K. Selçuk

AU - Raglin, Adrienne

AU - Liu, Huan

PY - 2020

Y1 - 2020

N2 - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

AB - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

KW - Benchmarking

KW - Big data

KW - Causal discovery

KW - Causal learning

KW - Datasets

KW - Treatment effect estimation

UR - http://www.scopus.com/inward/record.url?scp=85087008761&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85087008761&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-49556-5_23

DO - 10.1007/978-3-030-49556-5_23

M3 - Conference contribution

AN - SCOPUS:85087008761

SN - 9783030495558

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 234

EP - 248

BT - Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers

A2 - Gao, Wanling

A2 - Zhan, Jianfeng

A2 - Fox, Geoffrey

A2 - Lu, Xiaoyi

A2 - Stanzione, Dan

PB - Springer

T2 - 2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019

Y2 - 14 November 2019 through 16 November 2019

ER -

A practical data repository for causal learning with big data

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this