Data locality in MapReduce: A network perspective

Weina Wang, Lei Ying

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

Original languageEnglish (US)
Pages (from-to)1-11
Number of pages11
JournalPerformance Evaluation
Volume96
DOIs
StatePublished - Feb 1 2016

Fingerprint

Data Locality
MapReduce
Routing algorithms
Scheduler
Scheduling
Throughput
Routing Algorithm
Processing
Scheduling algorithms
Telecommunication networks
Task Scheduling
Scheduling Algorithm
Communication Networks
Congestion
Skew
Industry
Resources
Computing
Demonstrate
Simulation

Keywords

  • Data locality
  • MapReduce
  • Routing
  • Scheduling
  • Throughput

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software
  • Modeling and Simulation

Cite this

Data locality in MapReduce : A network perspective. / Wang, Weina; Ying, Lei.

In: Performance Evaluation, Vol. 96, 01.02.2016, p. 1-11.

Research output: Contribution to journalArticle

@article{ab603a3f8ff84bebb320ca0c0660e772,
title = "Data locality in MapReduce: A network perspective",
abstract = "Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.",
keywords = "Data locality, MapReduce, Routing, Scheduling, Throughput",
author = "Weina Wang and Lei Ying",
year = "2016",
month = "2",
day = "1",
doi = "10.1016/j.peva.2015.12.002",
language = "English (US)",
volume = "96",
pages = "1--11",
journal = "Performance Evaluation",
issn = "0166-5316",
publisher = "Elsevier",

}

TY - JOUR

T1 - Data locality in MapReduce

T2 - A network perspective

AU - Wang, Weina

AU - Ying, Lei

PY - 2016/2/1

Y1 - 2016/2/1

N2 - Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

AB - Data locality, a critical consideration for the performance of task scheduling in MapReduce, has been addressed in the literature by increasing the number of locally processed tasks. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. We prove that the Joint Scheduler is throughput optimal; i.e., it supports any load that is supportable by any other algorithm. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput and delay performance significantly compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard.

KW - Data locality

KW - MapReduce

KW - Routing

KW - Scheduling

KW - Throughput

UR - http://www.scopus.com/inward/record.url?scp=84961164965&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961164965&partnerID=8YFLogxK

U2 - 10.1016/j.peva.2015.12.002

DO - 10.1016/j.peva.2015.12.002

M3 - Article

AN - SCOPUS:84961164965

VL - 96

SP - 1

EP - 11

JO - Performance Evaluation

JF - Performance Evaluation

SN - 0166-5316

ER -