Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms

Arash Baratloo, Partha Dasgupta, Zvi M. Kedem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

35 Citations (Scopus)

Abstract

The importance of adapting networks of workstations for use as parallel processing platforms is well established. However, current solutions do not always address important issues that exist in real networks. External factors like the sharing of resources, unpredictable behavior of the network, and failures, are present in multiuser networks and must be addressed. Calypso is a prototype software system for writing and executing parallel programs on non-dedicated platforms, based on COTS networked workstations, operating systems, and compilers. Among notable properties of the system are: (1) simple programming paradigm incorporating shared memory constructs and separating the programming and the execution parallelism, (2) transparent utilization of unreliable shared resources by providing dynamic load balancing and fault tolerance, and (3) effective performance for large classes of coarse-grained computations. We present the system and report our initial experiments and performance results in settings that closely resemble the dynamic behavior of a 'real' network. Under varying work-load conditions, resource availability and process failures, the efficiency of the test program we present ranged from 84% to 94% bench-marked against a sequential program.

Original languageEnglish (US)
Title of host publicationIEEE International Symposium on High Performance Distributed Computing, Proceedings
PublisherIEEE
Pages122-129
Number of pages8
StatePublished - 1995
Externally publishedYes
EventProceedings of the 4th IEEE International Symposium on High Performance Distributed Computing - Washington, DC, USA
Duration: Aug 2 1995Aug 4 1995

Other

OtherProceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
CityWashington, DC, USA
Period8/2/958/4/95

Fingerprint

Computer programming
Dynamic loads
Processing
Fault tolerance
Resource allocation
Availability
Data storage equipment
Experiments

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Baratloo, A., Dasgupta, P., & Kedem, Z. M. (1995). Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms. In IEEE International Symposium on High Performance Distributed Computing, Proceedings (pp. 122-129). IEEE.

Calypso : a novel software system for fault-tolerant parallel processing on distributed platforms. / Baratloo, Arash; Dasgupta, Partha; Kedem, Zvi M.

IEEE International Symposium on High Performance Distributed Computing, Proceedings. IEEE, 1995. p. 122-129.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Baratloo, A, Dasgupta, P & Kedem, ZM 1995, Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms. in IEEE International Symposium on High Performance Distributed Computing, Proceedings. IEEE, pp. 122-129, Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, 8/2/95.
Baratloo A, Dasgupta P, Kedem ZM. Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms. In IEEE International Symposium on High Performance Distributed Computing, Proceedings. IEEE. 1995. p. 122-129
Baratloo, Arash ; Dasgupta, Partha ; Kedem, Zvi M. / Calypso : a novel software system for fault-tolerant parallel processing on distributed platforms. IEEE International Symposium on High Performance Distributed Computing, Proceedings. IEEE, 1995. pp. 122-129
@inproceedings{b8e1a929b7ee4f5b8a1ba6f0f14d4be3,
title = "Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms",
abstract = "The importance of adapting networks of workstations for use as parallel processing platforms is well established. However, current solutions do not always address important issues that exist in real networks. External factors like the sharing of resources, unpredictable behavior of the network, and failures, are present in multiuser networks and must be addressed. Calypso is a prototype software system for writing and executing parallel programs on non-dedicated platforms, based on COTS networked workstations, operating systems, and compilers. Among notable properties of the system are: (1) simple programming paradigm incorporating shared memory constructs and separating the programming and the execution parallelism, (2) transparent utilization of unreliable shared resources by providing dynamic load balancing and fault tolerance, and (3) effective performance for large classes of coarse-grained computations. We present the system and report our initial experiments and performance results in settings that closely resemble the dynamic behavior of a 'real' network. Under varying work-load conditions, resource availability and process failures, the efficiency of the test program we present ranged from 84{\%} to 94{\%} bench-marked against a sequential program.",
author = "Arash Baratloo and Partha Dasgupta and Kedem, {Zvi M.}",
year = "1995",
language = "English (US)",
pages = "122--129",
booktitle = "IEEE International Symposium on High Performance Distributed Computing, Proceedings",
publisher = "IEEE",

}

TY - GEN

T1 - Calypso

T2 - a novel software system for fault-tolerant parallel processing on distributed platforms

AU - Baratloo, Arash

AU - Dasgupta, Partha

AU - Kedem, Zvi M.

PY - 1995

Y1 - 1995

N2 - The importance of adapting networks of workstations for use as parallel processing platforms is well established. However, current solutions do not always address important issues that exist in real networks. External factors like the sharing of resources, unpredictable behavior of the network, and failures, are present in multiuser networks and must be addressed. Calypso is a prototype software system for writing and executing parallel programs on non-dedicated platforms, based on COTS networked workstations, operating systems, and compilers. Among notable properties of the system are: (1) simple programming paradigm incorporating shared memory constructs and separating the programming and the execution parallelism, (2) transparent utilization of unreliable shared resources by providing dynamic load balancing and fault tolerance, and (3) effective performance for large classes of coarse-grained computations. We present the system and report our initial experiments and performance results in settings that closely resemble the dynamic behavior of a 'real' network. Under varying work-load conditions, resource availability and process failures, the efficiency of the test program we present ranged from 84% to 94% bench-marked against a sequential program.

AB - The importance of adapting networks of workstations for use as parallel processing platforms is well established. However, current solutions do not always address important issues that exist in real networks. External factors like the sharing of resources, unpredictable behavior of the network, and failures, are present in multiuser networks and must be addressed. Calypso is a prototype software system for writing and executing parallel programs on non-dedicated platforms, based on COTS networked workstations, operating systems, and compilers. Among notable properties of the system are: (1) simple programming paradigm incorporating shared memory constructs and separating the programming and the execution parallelism, (2) transparent utilization of unreliable shared resources by providing dynamic load balancing and fault tolerance, and (3) effective performance for large classes of coarse-grained computations. We present the system and report our initial experiments and performance results in settings that closely resemble the dynamic behavior of a 'real' network. Under varying work-load conditions, resource availability and process failures, the efficiency of the test program we present ranged from 84% to 94% bench-marked against a sequential program.

UR - http://www.scopus.com/inward/record.url?scp=0029507111&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029507111&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0029507111

SP - 122

EP - 129

BT - IEEE International Symposium on High Performance Distributed Computing, Proceedings

PB - IEEE

ER -