TY - GEN
T1 - Map task scheduling in MapReduce with data locality
T2 - 32nd IEEE Conference on Computer Communications, IEEE INFOCOM 2013
AU - Wang, Weina
AU - Zhu, Kai
AU - Ying, Lei
AU - Tan, Jian
AU - Zhang, Li
PY - 2013/9/2
Y1 - 2013/9/2
N2 - Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little's law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.
AB - Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little's law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.
UR - http://www.scopus.com/inward/record.url?scp=84883072667&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84883072667&partnerID=8YFLogxK
U2 - 10.1109/INFCOM.2013.6566957
DO - 10.1109/INFCOM.2013.6566957
M3 - Conference contribution
AN - SCOPUS:84883072667
SN - 9781467359467
T3 - Proceedings - IEEE INFOCOM
SP - 1609
EP - 1617
BT - 2013 Proceedings IEEE INFOCOM 2013
Y2 - 14 April 2013 through 19 April 2013
ER -