NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
error recovery in shared memory multiprocessors using private cachesThe problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
Document ID
19900050528
Document Type
Reprint (Version printed in journal)
External Source(s)
Authors
Wu, Kun-Lung
(Illinois Univ. Urbana, IL, United States)
Fuchs, W. Kent
(Illinois Univ. Urbana, IL, United States)
Patel, Janak H.
(Illinois, University Urbana, United States)
Date Acquired
August 14, 2013
Publication Date
April 1, 1990
Publication Information
Publication: IEEE Transactions on Parallel and Distributed Systems
Volume: 1
ISSN: 1045-9219
Subject Category
COMPUTER SYSTEMS
Funding Number(s)
CONTRACT_GRANT: N00014-89-K-0070
CONTRACT_GRANT: NAG1-613
Distribution Limits
Public
Copyright
Other