Data locality in MapReduce: A network perspective

Weina Wang; Lei Ying

doi:10.1016/j.peva.2015.12.002

Data locality in MapReduce: A network perspective

Weina Wang, Lei Ying

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

Original language	English (US)
Pages (from-to)	1-11
Number of pages	11
Journal	Performance Evaluation
Volume	96
DOIs	https://doi.org/10.1016/j.peva.2015.12.002
State	Published - Feb 2016

Keywords

Data locality
MapReduce
Routing
Scheduling
Throughput

ASJC Scopus subject areas

Software
Modeling and Simulation
Hardware and Architecture
Computer Networks and Communications

Access to Document

10.1016/j.peva.2015.12.002

Cite this

@article{ab603a3f8ff84bebb320ca0c0660e772,

title = "Data locality in MapReduce: A network perspective",

abstract = "Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.",

keywords = "Data locality, MapReduce, Routing, Scheduling, Throughput",

author = "Weina Wang and Lei Ying",

note = "Funding Information: This work was supported in part by NSF Grant ECCS-1255425 . Weina Wang received her B.E. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2009. She is currently pursuing a Ph.D. degree in the School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ. Her research interests include resource allocation in stochastic networks, data privacy and game theory. She won the Joseph A. Barkson Fellowship for the 2015–16 academic year. Lei Ying (M{\textquoteright}08) received his B.E. degree from Tsinghua University, Beijing, China, and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign. He currently is an Associate Professor at the School of Electrical, Computer and Energy Engineering at Arizona State University, and an Associate Editor of the IEEE/ACM Transactions on Networking. His research interest is broadly in the area of stochastic networks, including cloud computing, communication networks and social networks. He is coauthor with R. Srikant of the book Communication Networks: An Optimization, Control and Stochastic Networks Perspective, Cambridge University Press, 2014. The book has been selected as a notable book in the Computing Reviews{\textquoteright} 19th Annual Best of Computing list. He won the Young Investigator Award from the Defense Threat Reduction Agency (DTRA) in 2009 and NSF CAREER Award in 2010. He was the Northrop Grumman Assistant Professor in the Department of Electrical and Computer Engineering at Iowa State University from 2010 to 2012. He received the best paper award at IEEE INFOCOM 2015. Publisher Copyright: {\textcopyright} 2015 Elsevier B.V. All rights reserved.",

year = "2016",

month = feb,

doi = "10.1016/j.peva.2015.12.002",

language = "English (US)",

volume = "96",

pages = "1--11",

journal = "Performance Evaluation",

issn = "0166-5316",

publisher = "Elsevier",

}

TY - JOUR

T1 - Data locality in MapReduce

T2 - A network perspective

AU - Wang, Weina

AU - Ying, Lei

N1 - Funding Information: This work was supported in part by NSF Grant ECCS-1255425 . Weina Wang received her B.E. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2009. She is currently pursuing a Ph.D. degree in the School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ. Her research interests include resource allocation in stochastic networks, data privacy and game theory. She won the Joseph A. Barkson Fellowship for the 2015–16 academic year. Lei Ying (M’08) received his B.E. degree from Tsinghua University, Beijing, China, and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign. He currently is an Associate Professor at the School of Electrical, Computer and Energy Engineering at Arizona State University, and an Associate Editor of the IEEE/ACM Transactions on Networking. His research interest is broadly in the area of stochastic networks, including cloud computing, communication networks and social networks. He is coauthor with R. Srikant of the book Communication Networks: An Optimization, Control and Stochastic Networks Perspective, Cambridge University Press, 2014. The book has been selected as a notable book in the Computing Reviews’ 19th Annual Best of Computing list. He won the Young Investigator Award from the Defense Threat Reduction Agency (DTRA) in 2009 and NSF CAREER Award in 2010. He was the Northrop Grumman Assistant Professor in the Department of Electrical and Computer Engineering at Iowa State University from 2010 to 2012. He received the best paper award at IEEE INFOCOM 2015. Publisher Copyright: © 2015 Elsevier B.V. All rights reserved.

PY - 2016/2

Y1 - 2016/2

N2 - Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

AB - Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

KW - Data locality

KW - MapReduce

KW - Routing

KW - Scheduling

KW - Throughput

UR - http://www.scopus.com/inward/record.url?scp=84961164965&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961164965&partnerID=8YFLogxK

U2 - 10.1016/j.peva.2015.12.002

DO - 10.1016/j.peva.2015.12.002

M3 - Article

AN - SCOPUS:84961164965

SN - 0166-5316

VL - 96

SP - 1

EP - 11

JO - Performance Evaluation

JF - Performance Evaluation

ER -

Data locality in MapReduce: A network perspective

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this