NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Failure analysis and modeling of a multicomputer systemThis thesis describes the results of an extensive measurement-based analysis of real error data collected from a 7-machine DEC VaxCluster multicomputer system. In addition to evaluating basic system error and failure characteristics, we develop reward models to analyze the impact of failures and errors on the system. The results show that, although 98 percent of errors in the shared resources recover, they result in 48 percent of all system failures. The analysis of rewards shows that the expected reward rate for the VaxCluster decreases to 0.5 in 100 days for a 3 out of 7 model, which is well over a 100 times that for a 7-out-of-7 model. A comparison of the reward rates for a range of k-out-of-n models indicates that the maximum increase in reward rate (0.25) occurs in going from the 6-out-of-7 model to the 5-out-of-7 model. The analysis also shows that software errors have the lowest reward (0.2 vs. 0.91 for network errors). The large loss in reward rate for software errors is due to the fact that a large proportion (94 percent) of software errors lead to failure. In comparison, the high reward rate for network errors is due to fast recovery from a majority of these errors (median recovery duration is 0 seconds).
Document ID
19900014664
Document Type
Thesis/Dissertation
Authors
Subramani, Sujatha Srinivasan (Illinois Univ. Urbana-Champaign, IL, United States)
Date Acquired
August 14, 2013
Publication Date
February 1, 1990
Subject Category
COMPUTER SYSTEMS
Report/Patent Number
CSG-120
AD-A219653
NAS 1.26:186697
NASA-CR-186697
UILU-ENG-90-2206
Funding Number(s)
CONTRACT_GRANT: NCA2-184
CONTRACT_GRANT: N00014-84-C-0149
CONTRACT_GRANT: NCA2-301
Distribution Limits
Public
Copyright
Work of the US Gov. Public Use Permitted.
Document Inquiry