Parallel processing on networks of workstations: a fault-tolerant, high performance approach

Partha Dasgupta; Zvi M. Kedem; Michael O. Rabin

Parallel processing on networks of workstations: a fault-tolerant, high performance approach

Partha Dasgupta, Zvi M. Kedem, Michael O. Rabin

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management, it is possible to build a execution environment for parallel programs on workstation networks. These techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. The network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. This gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. The same novel mechanisms ensure both properties.

Original language	English (US)
Title of host publication	Proceedings - International Conference on Distributed Computing Systems
Editors	Anon
Place of Publication	Piscataway, NJ, United States
Publisher	IEEE
Pages	467-474
Number of pages	8
State	Published - 1995
Event	Proceedings of the 15th International Conference on Distributed Computing Systems - Vancouver, Can Duration: May 30 1995 → Jun 2 1995

Other

Other	Proceedings of the 15th International Conference on Distributed Computing Systems
City	Vancouver, Can
Period	5/30/95 → 6/2/95

ASJC Scopus subject areas

Hardware and Architecture

Cite this

Parallel processing on networks of workstations: a fault-tolerant, high performance approach. / Dasgupta, Partha; Kedem, Zvi M.; Rabin, Michael O.
Proceedings - International Conference on Distributed Computing Systems. ed. / Anon. Piscataway, NJ, United States: IEEE, 1995. p. 467-474.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Dasgupta, P, Kedem, ZM & Rabin, MO 1995, Parallel processing on networks of workstations: a fault-tolerant, high performance approach. in Anon (ed.), Proceedings - International Conference on Distributed Computing Systems. IEEE, Piscataway, NJ, United States, pp. 467-474, Proceedings of the 15th International Conference on Distributed Computing Systems, Vancouver, Can, 5/30/95.

@inproceedings{0ccf6ed89efe49a68330c5bff589769d,

title = "Parallel processing on networks of workstations: a fault-tolerant, high performance approach",

abstract = "One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management, it is possible to build a execution environment for parallel programs on workstation networks. These techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. The network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. This gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. The same novel mechanisms ensure both properties.",

author = "Partha Dasgupta and Kedem, {Zvi M.} and Rabin, {Michael O.}",

year = "1995",

language = "English (US)",

pages = "467--474",

editor = "Anon",

booktitle = "Proceedings - International Conference on Distributed Computing Systems",

publisher = "IEEE",

note = "Proceedings of the 15th International Conference on Distributed Computing Systems ; Conference date: 30-05-1995 Through 02-06-1995",

}

TY - GEN

T1 - Parallel processing on networks of workstations

T2 - Proceedings of the 15th International Conference on Distributed Computing Systems

AU - Dasgupta, Partha

AU - Kedem, Zvi M.

AU - Rabin, Michael O.

PY - 1995

Y1 - 1995

N2 - One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management, it is possible to build a execution environment for parallel programs on workstation networks. These techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. The network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. This gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. The same novel mechanisms ensure both properties.

AB - One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management, it is possible to build a execution environment for parallel programs on workstation networks. These techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. The network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. This gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. The same novel mechanisms ensure both properties.

UR - http://www.scopus.com/inward/record.url?scp=0029217792&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029217792&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0029217792

SP - 467

EP - 474

BT - Proceedings - International Conference on Distributed Computing Systems

A2 - Anon, null

PB - IEEE

CY - Piscataway, NJ, United States

Y2 - 30 May 1995 through 2 June 1995

ER -

Parallel processing on networks of workstations: a fault-tolerant, high performance approach

Abstract

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this