NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Algorithm-Based Fault Tolerance for Numerical SubroutinesA software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.
Document ID
20100011232
Acquisition Source
Jet Propulsion Laboratory
Document Type
Other - NASA Tech Brief
Authors
Tumon, Michael
(California Inst. of Tech. Pasadena, CA, United States)
Granat, Robert
(California Inst. of Tech. Pasadena, CA, United States)
Lou, John
(California Inst. of Tech. Pasadena, CA, United States)
Date Acquired
August 24, 2013
Publication Date
November 1, 2007
Publication Information
Publication: NASA Tech Briefs, November 2007
Subject Category
Man/System Technology And Life Support
Report/Patent Number
NPO-42193
Distribution Limits
Public
Copyright
Public Use Permitted.
No Preview Available