Data-utility sensitive query processing on server clusters to support scalable data analysis services

Renwei Yu; Mithila Nagendra; Parth Nagarkar; Kasim Candan; Jong Wook Kim

doi:10.1007/978-3-642-19294-4_7

Data-utility sensitive query processing on server clusters to support scalable data analysis services

Renwei Yu, Mithila Nagendra, Parth Nagarkar, Kasim Candan, Jong Wook Kim

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from "upstream" operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).

Original language	English (US)
Title of host publication	New Frontiers in Information and Software as Services
Subtitle of host publication	Service and Application Design Challenges in the Cloud
Publisher	Springer Verlag
Pages	155-184
Number of pages	30
ISBN (Print)	9783642192937
DOIs	https://doi.org/10.1007/978-3-642-19294-4_7
State	Published - 2011

Publication series

Name	Lecture Notes in Business Information Processing
Volume	74 LNBIP
ISSN (Print)	1865-1348

ASJC Scopus subject areas

Control and Systems Engineering
Management Information Systems
Business and International Management
Information Systems
Modeling and Simulation
Information Systems and Management

Access to Document

10.1007/978-3-642-19294-4_7

Cite this

Yu, R., Nagendra, M., Nagarkar, P., Candan, K., & Kim, J. W. (2011). Data-utility sensitive query processing on server clusters to support scalable data analysis services. In New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud (pp. 155-184). (Lecture Notes in Business Information Processing; Vol. 74 LNBIP). Springer Verlag. https://doi.org/10.1007/978-3-642-19294-4_7

Data-utility sensitive query processing on server clusters to support scalable data analysis services. / Yu, Renwei; Nagendra, Mithila; Nagarkar, Parth et al.
New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud. Springer Verlag, 2011. p. 155-184 (Lecture Notes in Business Information Processing; Vol. 74 LNBIP).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Yu, R, Nagendra, M, Nagarkar, P, Candan, K & Kim, JW 2011, Data-utility sensitive query processing on server clusters to support scalable data analysis services. in New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud. Lecture Notes in Business Information Processing, vol. 74 LNBIP, Springer Verlag, pp. 155-184. https://doi.org/10.1007/978-3-642-19294-4_7

Yu R, Nagendra M, Nagarkar P, Candan K, Kim JW. Data-utility sensitive query processing on server clusters to support scalable data analysis services. In New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud. Springer Verlag. 2011. p. 155-184. (Lecture Notes in Business Information Processing). doi: 10.1007/978-3-642-19294-4_7

@inproceedings{63f31707ebdb43949428bb86035f52f5,

title = "Data-utility sensitive query processing on server clusters to support scalable data analysis services",

abstract = "The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from {"}upstream{"} operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).",

author = "Renwei Yu and Mithila Nagendra and Parth Nagarkar and Kasim Candan and Kim, {Jong Wook}",

note = "Funding Information: This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.",

year = "2011",

doi = "10.1007/978-3-642-19294-4_7",

language = "English (US)",

isbn = "9783642192937",

series = "Lecture Notes in Business Information Processing",

publisher = "Springer Verlag",

pages = "155--184",

booktitle = "New Frontiers in Information and Software as Services",

}

TY - GEN

T1 - Data-utility sensitive query processing on server clusters to support scalable data analysis services

AU - Yu, Renwei

AU - Nagendra, Mithila

AU - Nagarkar, Parth

AU - Candan, Kasim

AU - Kim, Jong Wook

N1 - Funding Information: This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.

PY - 2011

Y1 - 2011

N2 - The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from "upstream" operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).

AB - The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from "upstream" operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).

UR - http://www.scopus.com/inward/record.url?scp=84876336720&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876336720&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-19294-4_7

DO - 10.1007/978-3-642-19294-4_7

M3 - Conference contribution

AN - SCOPUS:84876336720

SN - 9783642192937

T3 - Lecture Notes in Business Information Processing

SP - 155

EP - 184

BT - New Frontiers in Information and Software as Services

PB - Springer Verlag

ER -

Data-utility sensitive query processing on server clusters to support scalable data analysis services

Abstract

Publication series

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this