Exploiting common subexpressions for cloud query processing

Yasin Silva, Paul Ake Larson, Jingren Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Data Engineering
Pages1337-1348
Number of pages12
DOIs
StatePublished - 2012
EventIEEE 28th International Conference on Data Engineering, ICDE 2012 - Arlington, VA, United States
Duration: Apr 1 2012Apr 5 2012

Other

OtherIEEE 28th International Conference on Data Engineering, ICDE 2012
CountryUnited States
CityArlington, VA
Period4/1/124/5/12

Fingerprint

Query processing
Costs
Servers
Industry

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Silva, Y., Larson, P. A., & Zhou, J. (2012). Exploiting common subexpressions for cloud query processing. In Proceedings - International Conference on Data Engineering (pp. 1337-1348). [6228202] https://doi.org/10.1109/ICDE.2012.106

Exploiting common subexpressions for cloud query processing. / Silva, Yasin; Larson, Paul Ake; Zhou, Jingren.

Proceedings - International Conference on Data Engineering. 2012. p. 1337-1348 6228202.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Silva, Y, Larson, PA & Zhou, J 2012, Exploiting common subexpressions for cloud query processing. in Proceedings - International Conference on Data Engineering., 6228202, pp. 1337-1348, IEEE 28th International Conference on Data Engineering, ICDE 2012, Arlington, VA, United States, 4/1/12. https://doi.org/10.1109/ICDE.2012.106
Silva Y, Larson PA, Zhou J. Exploiting common subexpressions for cloud query processing. In Proceedings - International Conference on Data Engineering. 2012. p. 1337-1348. 6228202 https://doi.org/10.1109/ICDE.2012.106
Silva, Yasin ; Larson, Paul Ake ; Zhou, Jingren. / Exploiting common subexpressions for cloud query processing. Proceedings - International Conference on Data Engineering. 2012. pp. 1337-1348
@inproceedings{6b7a935471f249558028c456eb728b6a,
title = "Exploiting common subexpressions for cloud query processing",
abstract = "Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57{\%} lower estimated costs.",
author = "Yasin Silva and Larson, {Paul Ake} and Jingren Zhou",
year = "2012",
doi = "10.1109/ICDE.2012.106",
language = "English (US)",
pages = "1337--1348",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Exploiting common subexpressions for cloud query processing

AU - Silva, Yasin

AU - Larson, Paul Ake

AU - Zhou, Jingren

PY - 2012

Y1 - 2012

N2 - Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

AB - Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

UR - http://www.scopus.com/inward/record.url?scp=84864252206&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864252206&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2012.106

DO - 10.1109/ICDE.2012.106

M3 - Conference contribution

AN - SCOPUS:84864252206

SP - 1337

EP - 1348

BT - Proceedings - International Conference on Data Engineering

ER -