Data locality in MapReduce: A network perspective

Weina Wang; Lei Ying

doi:10.1109/ALLERTON.2014.7028579

Data locality in MapReduce: A network perspective

Weina Wang, Lei Ying

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

8 Scopus citations

Abstract

In MapReduce, placing computation near its input data is considered to be desirable since otherwise the data transmission introduces an additional delay to the task execution. This data locality problem has been studied in the literature. Most existing scheduling algorithms in MapReduce focus on improving performance through increasing locality. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. In other words, instead of bringing computation close to data, we can also bring data close to computation to improve the system performance. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. To show that the Joint Scheduler has superior performance, we prove that the Join Scheduler can support any load that can be supported by some other algorithm, i.e., achieve the maximum capacity region. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput significantly (more than 30% in our simulations) compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard. The delay performance is also evaluated through simulations, where we can see a significant delay reduce under the Joint Scheduler with moderate to heavy load.

Original language	English (US)
Title of host publication	2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1110-1117
Number of pages	8
ISBN (Electronic)	9781479980093
DOIs	https://doi.org/10.1109/ALLERTON.2014.7028579
State	Published - Jan 30 2014
Event	2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014 - Monticello, United States Duration: Sep 30 2014 → Oct 3 2014

Publication series

Name	2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014

Other

Other	2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014
Country/Territory	United States
City	Monticello
Period	9/30/14 → 10/3/14

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications

Access to Document

10.1109/ALLERTON.2014.7028579

Cite this

Wang, W., & Ying, L. (2014). Data locality in MapReduce: A network perspective. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014 (pp. 1110-1117). Article 7028579 (2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ALLERTON.2014.7028579

Data locality in MapReduce: A network perspective. / Wang, Weina; Ying, Lei.
2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014. Institute of Electrical and Electronics Engineers Inc., 2014. p. 1110-1117 7028579 (2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Wang, W & Ying, L 2014, Data locality in MapReduce: A network perspective. in 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014., 7028579, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014, Institute of Electrical and Electronics Engineers Inc., pp. 1110-1117, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014, Monticello, United States, 9/30/14. https://doi.org/10.1109/ALLERTON.2014.7028579

Wang W, Ying L. Data locality in MapReduce: A network perspective. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 1110-1117. 7028579. (2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014). doi: 10.1109/ALLERTON.2014.7028579

@inproceedings{5628c2560dfb476e9acb4fdbe5e52dd5,

title = "Data locality in MapReduce: A network perspective",

abstract = "In MapReduce, placing computation near its input data is considered to be desirable since otherwise the data transmission introduces an additional delay to the task execution. This data locality problem has been studied in the literature. Most existing scheduling algorithms in MapReduce focus on improving performance through increasing locality. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. In other words, instead of bringing computation close to data, we can also bring data close to computation to improve the system performance. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. To show that the Joint Scheduler has superior performance, we prove that the Join Scheduler can support any load that can be supported by some other algorithm, i.e., achieve the maximum capacity region. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput significantly (more than 30% in our simulations) compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard. The delay performance is also evaluated through simulations, where we can see a significant delay reduce under the Joint Scheduler with moderate to heavy load.",

author = "Weina Wang and Lei Ying",

note = "Funding Information: This work was supported in part by NSF Grant ECCS-1255425 . Weina Wang received her B.E. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2009. She is currently pursuing a Ph.D. degree in the School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ. Her research interests include resource allocation in stochastic networks, data privacy and game theory. She won the Joseph A. Barkson Fellowship for the 2015–16 academic year. Lei Ying (M{\textquoteright}08) received his B.E. degree from Tsinghua University, Beijing, China, and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign. He currently is an Associate Professor at the School of Electrical, Computer and Energy Engineering at Arizona State University, and an Associate Editor of the IEEE/ACM Transactions on Networking. His research interest is broadly in the area of stochastic networks, including cloud computing, communication networks and social networks. He is coauthor with R. Srikant of the book Communication Networks: An Optimization, Control and Stochastic Networks Perspective, Cambridge University Press, 2014. The book has been selected as a notable book in the Computing Reviews{\textquoteright} 19th Annual Best of Computing list. He won the Young Investigator Award from the Defense Threat Reduction Agency (DTRA) in 2009 and NSF CAREER Award in 2010. He was the Northrop Grumman Assistant Professor in the Department of Electrical and Computer Engineering at Iowa State University from 2010 to 2012. He received the best paper award at IEEE INFOCOM 2015. Publisher Copyright: {\textcopyright} 2014 IEEE.; 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014 ; Conference date: 30-09-2014 Through 03-10-2014",

year = "2014",

month = jan,

day = "30",

doi = "10.1109/ALLERTON.2014.7028579",

language = "English (US)",

series = "2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1110--1117",

booktitle = "2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014",

}

TY - GEN

T1 - Data locality in MapReduce

T2 - 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014

AU - Wang, Weina

AU - Ying, Lei

N1 - Funding Information: This work was supported in part by NSF Grant ECCS-1255425 . Weina Wang received her B.E. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2009. She is currently pursuing a Ph.D. degree in the School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ. Her research interests include resource allocation in stochastic networks, data privacy and game theory. She won the Joseph A. Barkson Fellowship for the 2015–16 academic year. Lei Ying (M’08) received his B.E. degree from Tsinghua University, Beijing, China, and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign. He currently is an Associate Professor at the School of Electrical, Computer and Energy Engineering at Arizona State University, and an Associate Editor of the IEEE/ACM Transactions on Networking. His research interest is broadly in the area of stochastic networks, including cloud computing, communication networks and social networks. He is coauthor with R. Srikant of the book Communication Networks: An Optimization, Control and Stochastic Networks Perspective, Cambridge University Press, 2014. The book has been selected as a notable book in the Computing Reviews’ 19th Annual Best of Computing list. He won the Young Investigator Award from the Defense Threat Reduction Agency (DTRA) in 2009 and NSF CAREER Award in 2010. He was the Northrop Grumman Assistant Professor in the Department of Electrical and Computer Engineering at Iowa State University from 2010 to 2012. He received the best paper award at IEEE INFOCOM 2015. Publisher Copyright: © 2014 IEEE.

PY - 2014/1/30

Y1 - 2014/1/30

N2 - In MapReduce, placing computation near its input data is considered to be desirable since otherwise the data transmission introduces an additional delay to the task execution. This data locality problem has been studied in the literature. Most existing scheduling algorithms in MapReduce focus on improving performance through increasing locality. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. In other words, instead of bringing computation close to data, we can also bring data close to computation to improve the system performance. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. To show that the Joint Scheduler has superior performance, we prove that the Join Scheduler can support any load that can be supported by some other algorithm, i.e., achieve the maximum capacity region. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput significantly (more than 30% in our simulations) compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard. The delay performance is also evaluated through simulations, where we can see a significant delay reduce under the Joint Scheduler with moderate to heavy load.

AB - In MapReduce, placing computation near its input data is considered to be desirable since otherwise the data transmission introduces an additional delay to the task execution. This data locality problem has been studied in the literature. Most existing scheduling algorithms in MapReduce focus on improving performance through increasing locality. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. In other words, instead of bringing computation close to data, we can also bring data close to computation to improve the system performance. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. To show that the Joint Scheduler has superior performance, we prove that the Join Scheduler can support any load that can be supported by some other algorithm, i.e., achieve the maximum capacity region. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput significantly (more than 30% in our simulations) compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard. The delay performance is also evaluated through simulations, where we can see a significant delay reduce under the Joint Scheduler with moderate to heavy load.

UR - http://www.scopus.com/inward/record.url?scp=84946692901&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84946692901&partnerID=8YFLogxK

U2 - 10.1109/ALLERTON.2014.7028579

DO - 10.1109/ALLERTON.2014.7028579

M3 - Conference contribution

AN - SCOPUS:84946692901

T3 - 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014

SP - 1110

EP - 1117

BT - 2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 30 September 2014 through 3 October 2014

ER -

Data locality in MapReduce: A network perspective

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this