NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Automatic recognition of intermittent failures - An experimental study of field dataA methodology is proposed for recognizing the symptoms of persistent problems in large systems. The system error rate is used to identify the error states among which relationships may exist. Statistical techniques are used to validate and quantify the strength of the relationship among these error states. As input, the approach takes the raw error logs containing a single entry for each error that is detected as an isolated event. As output, it produces a list of symptoms that characterize persistent errors. Thus, given a failure, it is determined whether the failure is an intermittent manifestation of a common fault or whether it is an isolated (transient) incident. The technique is shown to work on two CYBER systems and an IBM 3081 multiprocessor system. Comparisons to real failure/repair information obtained from field engineers show that, in about 85 percent of the cases, the error symptoms recognized by this approach correspond to real problems. The remaining 15 percent of the cases, although not directly supported by field data, are confirmed as being valid problems.
Document ID
19900050569
Acquisition Source
Legacy CDMS
Document Type
Reprint (Version printed in journal)
External Source(s)
Authors
Iyer, Ravishankar K.
(Illinois Univ. Urbana, IL, United States)
Young, Luke T.
(Illinois, University Urbana, United States)
Krishna Iyer, P. V.
(Illinois Univ. Urbana, IL, United States)
Date Acquired
August 14, 2013
Publication Date
April 1, 1990
Publication Information
Publication: IEEE Transactions on Computers
Volume: 39
ISSN: 0018-9340
Subject Category
Computer Systems
Accession Number
90A37624
Funding Number(s)
CONTRACT_GRANT: N00014-84-C-0149
CONTRACT_GRANT: NAG1-613
Distribution Limits
Public
Copyright
Other

Available Downloads

There are no available downloads for this record.
No Preview Available