A throughput optimal algorithm for map task scheduling in MapReduce with data locality

Weina Wang; Kai Zhu; Lei Ying; Jian Tan; Li Zhang

doi:10.1145/2479942.2479947

A throughput optimal algorithm for map task scheduling in MapReduce with data locality

Weina Wang, Kai Zhu, Lei Ying, Jian Tan, Li Zhang

Research output: Chapter in Book/Report/Conference proceeding › Chapter

21 Scopus citations

Abstract

MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

Original language	English (US)
Title of host publication	Performance Evaluation Review
Pages	33-42
Number of pages	10
Volume	40
Edition	4
DOIs	https://doi.org/10.1145/2479942.2479947
State	Published - Apr 2013

ASJC Scopus subject areas

Computer Networks and Communications
Hardware and Architecture
Software

Access to Document

10.1145/2479942.2479947

Cite this

@inbook{72d8507ad4ef409c860ede19c8a145f4,

title = "A throughput optimal algorithm for map task scheduling in MapReduce with data locality",

abstract = "MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).",

author = "Weina Wang and Kai Zhu and Lei Ying and Jian Tan and Li Zhang",

year = "2013",

month = apr,

doi = "10.1145/2479942.2479947",

language = "English (US)",

volume = "40",

pages = "33--42",

booktitle = "Performance Evaluation Review",

edition = "4",

}

TY - CHAP

T1 - A throughput optimal algorithm for map task scheduling in MapReduce with data locality

AU - Wang, Weina

AU - Zhu, Kai

AU - Ying, Lei

AU - Tan, Jian

AU - Zhang, Li

PY - 2013/4

Y1 - 2013/4

N2 - MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

AB - MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

UR - http://www.scopus.com/inward/record.url?scp=84883467278&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883467278&partnerID=8YFLogxK

U2 - 10.1145/2479942.2479947

DO - 10.1145/2479942.2479947

M3 - Chapter

AN - SCOPUS:84883467278

VL - 40

SP - 33

EP - 42

BT - Performance Evaluation Review

ER -

A throughput optimal algorithm for map task scheduling in MapReduce with data locality

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this