Recovery in the Calypso File System

Murthy Devarakonda; Bill Kish; Ajay Mohindra

doi:10.1145/233557.233560

Recovery in the Calypso File System

Murthy Devarakonda, Bill Kish, Ajay Mohindra

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

This article presents the design and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recovery. The scheme uses distributed state among the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.

Original language	English (US)
Pages (from-to)	287-310
Number of pages	24
Journal	ACM Transactions on Computer Systems
Volume	14
Issue number	3
DOIs	https://doi.org/10.1145/233557.233560
State	Published - Aug 1996
Externally published	Yes

Keywords

C.4 [Computer Systems Organization]: Performance of Systems
D.4.3 [Operating Systems]: File Systems Management - distributed file systems
D.4.5 [Operating Systems]: Reliability - fault-tolerance

ASJC Scopus subject areas

General Computer Science

Access to Document

10.1145/233557.233560

Cite this

@article{d58b2e63f50a4a858e6996cc91e6089a,

title = "Recovery in the Calypso File System",

abstract = "This article presents the design and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recovery. The scheme uses distributed state among the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.",

keywords = "C.4 [Computer Systems Organization]: Performance of Systems, D.4.3 [Operating Systems]: File Systems Management - distributed file systems, D.4.5 [Operating Systems]: Reliability - fault-tolerance",

author = "Murthy Devarakonda and Bill Kish and Ajay Mohindra",

year = "1996",

month = aug,

doi = "10.1145/233557.233560",

language = "English (US)",

volume = "14",

pages = "287--310",

journal = "ACM Transactions on Computer Systems",

issn = "0734-2071",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - Recovery in the Calypso File System

AU - Devarakonda, Murthy

AU - Kish, Bill

AU - Mohindra, Ajay

PY - 1996/8

Y1 - 1996/8

N2 - This article presents the design and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recovery. The scheme uses distributed state among the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.

AB - This article presents the design and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is nondisruptive, meaning that open files remain open, client modified data are saved, and in-flight operations are properly handled across server recovery. The scheme uses distributed state among the clients to reconstruct the server state on a backup node if disks are multiported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, even when using journaling techniques.

KW - C.4 [Computer Systems Organization]: Performance of Systems

KW - D.4.3 [Operating Systems]: File Systems Management - distributed file systems

KW - D.4.5 [Operating Systems]: Reliability - fault-tolerance

UR - http://www.scopus.com/inward/record.url?scp=0030215945&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0030215945&partnerID=8YFLogxK

U2 - 10.1145/233557.233560

DO - 10.1145/233557.233560

M3 - Article

AN - SCOPUS:0030215945

SN - 0734-2071

VL - 14

SP - 287

EP - 310

JO - ACM Transactions on Computer Systems

JF - ACM Transactions on Computer Systems

IS - 3

ER -

Recovery in the Calypso File System

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this