A practical data repository for causal learning with big data

Lu Cheng, Ruocheng Guo, Raha Moraffah, K. Selçuk Candan, Adrienne Raglin, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

Original languageEnglish (US)
Title of host publicationBenchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers
EditorsWanling Gao, Jianfeng Zhan, Geoffrey Fox, Xiaoyi Lu, Dan Stanzione
PublisherSpringer
Pages234-248
Number of pages15
ISBN (Print)9783030495558
DOIs
StatePublished - 2020
Event2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019 - Denver, United States
Duration: Nov 14 2019Nov 16 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12093 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019
CountryUnited States
CityDenver
Period11/14/1911/16/19

Keywords

  • Benchmarking
  • Big data
  • Causal discovery
  • Causal learning
  • Datasets
  • Treatment effect estimation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'A practical data repository for causal learning with big data'. Together they form a unique fingerprint.

  • Cite this

    Cheng, L., Guo, R., Moraffah, R., Candan, K. S., Raglin, A., & Liu, H. (2020). A practical data repository for causal learning with big data. In W. Gao, J. Zhan, G. Fox, X. Lu, & D. Stanzione (Eds.), Benchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers (pp. 234-248). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12093 LNCS). Springer. https://doi.org/10.1007/978-3-030-49556-5_23