TY - GEN
T1 - Data-utility sensitive query processing on server clusters to support scalable data analysis services
AU - Yu, Renwei
AU - Nagendra, Mithila
AU - Nagarkar, Parth
AU - Candan, Kasim
AU - Kim, Jong Wook
N1 - Funding Information:
This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.
PY - 2011
Y1 - 2011
N2 - The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from "upstream" operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).
AB - The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. RanKloud is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the uSplit data partitioning and work-allocation strategies of RanKloud for processing top-k join queries to support data analysis services. In particular, we describe how uSplit adaptively samples data from "upstream" operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).
UR - http://www.scopus.com/inward/record.url?scp=84876336720&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84876336720&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-19294-4_7
DO - 10.1007/978-3-642-19294-4_7
M3 - Conference contribution
AN - SCOPUS:84876336720
SN - 9783642192937
T3 - Lecture Notes in Business Information Processing
SP - 155
EP - 184
BT - New Frontiers in Information and Software as Services
PB - Springer Verlag
ER -