The bounded data reuse problem in scientific workflows

Mohsen Zohrevandi, Rida Bazzi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013
Pages1051-1062
Number of pages12
DOIs
StatePublished - 2013
Event27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013 - Boston, MA, United States
Duration: May 20 2013May 24 2013

Other

Other27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013
CountryUnited States
CityBoston, MA
Period5/20/135/24/13

Fingerprint

Natural sciences computing
Integer programming
Experiments

Keywords

  • Data Reuse
  • Intermediate Data
  • Scientific Workflows
  • Series-Parallel

ASJC Scopus subject areas

  • Software

Cite this

Zohrevandi, M., & Bazzi, R. (2013). The bounded data reuse problem in scientific workflows. In Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013 (pp. 1051-1062). [6569884] https://doi.org/10.1109/IPDPS.2013.71

The bounded data reuse problem in scientific workflows. / Zohrevandi, Mohsen; Bazzi, Rida.

Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. 2013. p. 1051-1062 6569884.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zohrevandi, M & Bazzi, R 2013, The bounded data reuse problem in scientific workflows. in Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013., 6569884, pp. 1051-1062, 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, Boston, MA, United States, 5/20/13. https://doi.org/10.1109/IPDPS.2013.71
Zohrevandi M, Bazzi R. The bounded data reuse problem in scientific workflows. In Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. 2013. p. 1051-1062. 6569884 https://doi.org/10.1109/IPDPS.2013.71
Zohrevandi, Mohsen ; Bazzi, Rida. / The bounded data reuse problem in scientific workflows. Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013. 2013. pp. 1051-1062
@inproceedings{d724eaac23cd4c479cfdca3476e72916,
title = "The bounded data reuse problem in scientific workflows",
abstract = "Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1{\%} on average.",
keywords = "Data Reuse, Intermediate Data, Scientific Workflows, Series-Parallel",
author = "Mohsen Zohrevandi and Rida Bazzi",
year = "2013",
doi = "10.1109/IPDPS.2013.71",
language = "English (US)",
pages = "1051--1062",
booktitle = "Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013",

}

TY - GEN

T1 - The bounded data reuse problem in scientific workflows

AU - Zohrevandi, Mohsen

AU - Bazzi, Rida

PY - 2013

Y1 - 2013

N2 - Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.

AB - Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.

KW - Data Reuse

KW - Intermediate Data

KW - Scientific Workflows

KW - Series-Parallel

UR - http://www.scopus.com/inward/record.url?scp=84884862421&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84884862421&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2013.71

DO - 10.1109/IPDPS.2013.71

M3 - Conference contribution

AN - SCOPUS:84884862421

SP - 1051

EP - 1062

BT - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013

ER -