A practical data repository for causal learning with big data

Lu Cheng, Ruocheng Guo, Raha Moraffah, K. Selçuk Candan, Adrienne Raglin, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations


The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

Original languageEnglish (US)
Title of host publicationBenchmarking, Measuring, and Optimizing - 2nd BenchCouncil International Symposium, Bench 2019, Revised Selected Papers
EditorsWanling Gao, Jianfeng Zhan, Geoffrey Fox, Xiaoyi Lu, Dan Stanzione
Number of pages15
ISBN (Print)9783030495558
StatePublished - 2020
Event2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019 - Denver, United States
Duration: Nov 14 2019Nov 16 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12093 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference2nd International Symposium on Benchmarking, Measuring, and Optimization, Bench 2019
Country/TerritoryUnited States


  • Benchmarking
  • Big data
  • Causal discovery
  • Causal learning
  • Datasets
  • Treatment effect estimation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'A practical data repository for causal learning with big data'. Together they form a unique fingerprint.

Cite this