Exploiting common subexpressions for cloud query processing

Yasin Silva, Paul Ake Larson, Jingren Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Scopus citations

Abstract

Many companies now routinely run massive data analysis jobs - expressed in some scripting language - on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Data Engineering
Pages1337-1348
Number of pages12
DOIs
StatePublished - 2012
EventIEEE 28th International Conference on Data Engineering, ICDE 2012 - Arlington, VA, United States
Duration: Apr 1 2012Apr 5 2012

Other

OtherIEEE 28th International Conference on Data Engineering, ICDE 2012
CountryUnited States
CityArlington, VA
Period4/1/124/5/12

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Fingerprint Dive into the research topics of 'Exploiting common subexpressions for cloud query processing'. Together they form a unique fingerprint.

Cite this