TY - GEN
T1 - A practical data repository for causal learning with big data
AU - Cheng, Lu
AU - Guo, Ruocheng
AU - Moraffah, Raha
AU - Candan, K. Selçuk
AU - Raglin, Adrienne
AU - Liu, Huan
PY - 2020
Y1 - 2020
N2 - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.
AB - The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.
KW - Benchmarking
KW - Big data
KW - Causal discovery
KW - Causal learning
KW - Datasets
KW - Treatment effect estimation
UR - http://www.scopus.com/inward/record.url?scp=85087008761&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85087008761&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-49556-5_23
DO - 10.1007/978-3-030-49556-5_23
M3 - Conference contribution
AN - SCOPUS:85087008761
SN - 9783030495558
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 234
EP - 248
BT - Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers
A2 - Gao, Wanling
A2 - Zhan, Jianfeng
A2 - Fox, Geoffrey
A2 - Lu, Xiaoyi
A2 - Stanzione, Dan
PB - Springer
T2 - 2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019
Y2 - 14 November 2019 through 16 November 2019
ER -