A throughput optimal algorithm for map task scheduling in MapReduce with data locality

Weina Wang, Kai Zhu, Lei Ying, Jian Tan, Li Zhang

Research output: Chapter in Book/Report/Conference proceedingChapter

19 Citations (Scopus)

Abstract

MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

Original languageEnglish (US)
Title of host publicationPerformance Evaluation Review
Pages33-42
Number of pages10
Volume40
Edition4
DOIs
StatePublished - Apr 2013

Fingerprint

Cluster computing
Scheduling
Throughput
Scheduling algorithms
Resource allocation
Processing

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Cite this

Wang, W., Zhu, K., Ying, L., Tan, J., & Zhang, L. (2013). A throughput optimal algorithm for map task scheduling in MapReduce with data locality. In Performance Evaluation Review (4 ed., Vol. 40, pp. 33-42) https://doi.org/10.1145/2479942.2479947

A throughput optimal algorithm for map task scheduling in MapReduce with data locality. / Wang, Weina; Zhu, Kai; Ying, Lei; Tan, Jian; Zhang, Li.

Performance Evaluation Review. Vol. 40 4. ed. 2013. p. 33-42.

Research output: Chapter in Book/Report/Conference proceedingChapter

Wang, W, Zhu, K, Ying, L, Tan, J & Zhang, L 2013, A throughput optimal algorithm for map task scheduling in MapReduce with data locality. in Performance Evaluation Review. 4 edn, vol. 40, pp. 33-42. https://doi.org/10.1145/2479942.2479947
Wang W, Zhu K, Ying L, Tan J, Zhang L. A throughput optimal algorithm for map task scheduling in MapReduce with data locality. In Performance Evaluation Review. 4 ed. Vol. 40. 2013. p. 33-42 https://doi.org/10.1145/2479942.2479947
Wang, Weina ; Zhu, Kai ; Ying, Lei ; Tan, Jian ; Zhang, Li. / A throughput optimal algorithm for map task scheduling in MapReduce with data locality. Performance Evaluation Review. Vol. 40 4. ed. 2013. pp. 33-42
@inbook{72d8507ad4ef409c860ede19c8a145f4,
title = "A throughput optimal algorithm for map task scheduling in MapReduce with data locality",
abstract = "MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).",
author = "Weina Wang and Kai Zhu and Lei Ying and Jian Tan and Li Zhang",
year = "2013",
month = "4",
doi = "10.1145/2479942.2479947",
language = "English (US)",
volume = "40",
pages = "33--42",
booktitle = "Performance Evaluation Review",
edition = "4",

}

TY - CHAP

T1 - A throughput optimal algorithm for map task scheduling in MapReduce with data locality

AU - Wang, Weina

AU - Zhu, Kai

AU - Ying, Lei

AU - Tan, Jian

AU - Zhang, Li

PY - 2013/4

Y1 - 2013/4

N2 - MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

AB - MapReduce/Hadoop framework has been widely used to process large-scale datasets on computing clusters. Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and throughput optimal algorithms, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to maximize throughput. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm can stabilize any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. The proofs in this paper deal with random processing time with different parameters and nonpreemptive tasks, which differentiate our work from many other works, so the proof technique itself is also a contribution of this paper. Copyright is held by author/owner(s).

UR - http://www.scopus.com/inward/record.url?scp=84883467278&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883467278&partnerID=8YFLogxK

U2 - 10.1145/2479942.2479947

DO - 10.1145/2479942.2479947

M3 - Chapter

VL - 40

SP - 33

EP - 42

BT - Performance Evaluation Review

ER -