

#### A New Approach to System-Level Single Event Survivability Prediction



To be presented by Melanie Berg at the Microelectronics Reliability & Qualification Working Meeting (MRQW), El Segundo, CA February 6-7, 2018

## Acronyms



- Combinatorial logic (CL)
- Commercial off the shelf (COTS)
- Complementary metal-oxide semiconductor (CMOS)
- Device under test (DUT)
- Edge-triggered flip-flops (DFFs)
- Electronic design automation (EDA)
- Error rate (λ)
- Error rate per bit( $\lambda_{bit}$ )
- Error rate per system( $\lambda_{system}$ )
- Field programmable gate array (FPGA)
- Global triple modular redundancy (GTMR)
- Hardware description language (HDL)
- Input output (I/O)
- Intellectual Property (IP)
- Linear energy transfer (LET)
- Mean fluence to failure (MFTF)
- Mean time to failure (MTTF)
- Number of used bits (#Usedbits)
- Operational frequency (fs)
- Personal Computer (PC)

- Probability of configuration upsets (P<sub>configuration</sub>)
- Probability of Functional Logic upsets (P<sub>functionalLogic</sub>)
- Probability of single event functional interrupt (P<sub>SEFI</sub>)
- Probability of system failure (P<sub>system</sub>)
- Processor (PC)
- Radiation Effects and Analysis Group (REAG)
- Reliability over time (R(t))
- Reliability over fluence (R( $\Phi$ ))
- Single event effect (SEE)
- Single event functional interrupt (SEFI)
- Single event latch-up (SEL)
- Single event transient (SET)
- Single event upset (SEU)
- Single event upset cross-section (σ<sub>SEU</sub>)
- System on a chip (SoC)
- Windowed Shift Register (WSR)
- Xilinx Virtex 5 field programmable gate array (V5)
- Xilinx Virtex 5 field programmable gate array radiation hardened (V5QV)



### **Problem Statement and Abstract**

- The process for application of single event upset (SEU) data used to characterize system performance in radiation environments needs improvement.
- We are investigating the application of classical reliability performance metrics combined with standard SEU analysis data to improve system survivability prediction.

This presentation is a simplified approach for SEU data extrapolation to complex systems. Future work will incorporate additional details.

# **Background (1) : FPGA SEU Susceptibility**



SEU Cross Section ( $\sigma_{SEU}$ )

- $\sigma_{SEU}$ s (per category) are calculated from SEU test and analysis.
- $\sigma_{SEU}$ s are calculated per particle linear energy transfer (LET).
- Most believe the dominant σ<sub>SEU</sub>s are per bit (configuration or flipflops (DFFs)). However, global routes are significant (more than DFFs).





#### Background (2) Conventional Conversion of SEU Cross-Sections To Error Rates for Complex Systems Next Step

- **Bottom-Up approach (transistor level)**:
  - Given  $\sigma_{SEU}$  (per bit) use an error rate calculator (such as CRÈME96) to obtain an error rate per bit ( $\lambda_{bit}$ ).
  - Multiply  $\lambda_{bit}$  by the number of used memory bits (#UsedBits) in the target design to attain a system error rate ( $\lambda_{system}$ ). Configuration and DFFs.
- **Top-Down approach (system level):** 
  - Given  $\sigma_{SEU}$  (per system) use an error rate calculator (such as CRÈME96) to obtain an error rate per bit ( $\lambda_{system}$ ).





# **Technical Problems with Current Methods of Error Rate Calculation**

- For submission to CRÈME96, σ<sub>SEU</sub> data (in Log-linear form) are fitted to a Weibull curve.
  - During the curve fitting process, a large amount of error can be introduced.
  - Consequently, it is possible for resultant error rates (for the same design) to vary by decades.
- Because of the error rate calculation process, σ<sub>SEU</sub> data are blended together and it is nearly impossible to hone in on the problem spots. This can become important for mitigation insertion.



#### Technical Problems with Bottom-Up Analysis Method

- Multiplying each bit within a design by λ<sub>bit</sub> is not an efficient method of system error rate prediction.
  - Works well with memory structures...
    but...complex systems do not operate or respond like memories.
  - If an SEU affects a bit, and the bit is either inactive, disabled, or masked, a system malfunction might not occur.
    - Using the same multiplication factor across DFFs will produce extreme over-estimates.



 $\lambda_{system} < \lambda_{bit} \times \#UsedBits$ 

#### Let's Not Reinvent The Wheel... A Proven Solution Can Be Found in Classical Reliability System-Level Analysis

#### assumes that during useful-lifetime: Failures are independent. $R(t)=e^{-t/MTTF}$ or $R(t)=e^{-\lambda t}$ Weibull slope = 1... exponential.

- Error rate is constant.
- MTTF =  $1/\lambda$ .
- For a given LET (across fluence):
  - SEUs are independent.
  - $\sigma_{SEU}$  is constant.
  - MFTF =  $1/\sigma_{SEU}$ .
- Hence, mapping from the time domain to the fluence domain (per LET) is straight forward:
  - t⇔Φ
  - MTTF ⇔ MFTF

 $-\lambda \Leftrightarrow \sigma_{SEU}$ 

 $K(t) = e^{-t/MTTF}$ 

#### Mapping Classical Reliability Models from The Time Domain To The Fluence Domain The exponential model that relates reliability to MTTF



Parallel between time and fluence.

> $\sigma_{SEU} = #errors/fluence$  $\lambda_{system} = #errors/time$

$$(\Phi) = e^{-\Phi/MFTF}$$



### Example of Proposed Methodology Application

- Mission requirements:
  - Selection shall be made between a Xilinx V5QV (relatively expensive device) or a Xilinx V5 with embedded PowerPC (relatively cheap device).
  - FPGA operation shall have reliability of 3-nines (99.9%) within a 10 minute window at Geosynchronous Equatorial Orbit (GEO).
- Proposed methodology:
  - Create a histogram of particle flux versus LET for a 10minute window of time for your target environment.
  - Calculate MFTF per LET (obtain SEU data).
  - Graph R( $\Phi$ ) for a variety of LET values and their associated MFTFs. R( $\Phi$ )=e<sup>- $\Phi$ /MFTF</sup>
  - For selected ranges of LETs, use an upper bound of particle flux (number of particles/cm<sup>2</sup>•10-minutes), to determine if the system will meet the mission's reliability requirements.

# **Environment Data: Flux versus LET Histogram for A 10-minute Window**



Geosynchronous Equatorial Orbit (GEO) 100-mils shielding



#### MFTF versus LET for the Xilinx V5 Embedded PowerPC Core and the Xilinx V5QV MicroBlaze Soft Processor Core

NASA

- V5QV: no system errors were observed below LET=1.8 MeV•cm<sup>2</sup>/mg. Total fluence > 5.0×10<sup>8</sup> particles/cm<sup>2</sup>.
- PowerPC:
  - No system errors were observed below
     LET=0.07MeV•cm<sup>2</sup>/mg with total fluence = 3.×10<sup>7</sup> particles/cm<sup>2</sup>.
  - Hence, at 0.07, we will assume an upper-bound MFTF = 3.0×10<sup>7</sup> particles/cm<sup>2</sup>.
  - More tests would increase the MFTF for this bin.



#### Reliability across Fluence up to LET=0.07MeV•cm<sup>2</sup>/mg



# Reliability across Fluence up to LET=0.14MeV•cm<sup>2</sup>/mg





#### Reliability across Fluence up to LET=1.8 MeV•cm<sup>2</sup>/mg





## Reliability across Fluence up to LET=3.6MeV•cm<sup>2</sup>/mg





# Within this LET range, reliability at 0.23 particles/(cm<sup>2</sup>•10-minutes) > 99.999% for both design implementations.

#### **Reliability across Fluence at** LET=40MeVcm<sup>2</sup>/mg Binned GEO environment data show approximately 0.07 particles/(cm<sup>2</sup>•10-minutes), in the range of 3.6MeV•cm<sup>2</sup>/mg to 40.0MeV•cm<sup>2</sup>/mg. 0.9999 V5QV: MFTF= 2×10<sup>4</sup> PowerPC: MFTF = $2.8 \times 10^2$ 0.9998 Reliability 0.9997 0.9996 We fall below 99.99% $R(\Phi) = e^{-\Phi/2.0 \times 10^4}$ $R(\Phi) = e^{-\Phi/2.8 \times 10^2}$ at approximately 0.9995 0.02particles/cm<sup>2</sup>! 0.9994 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Fluence (particle/cm<sup>2</sup>) Within this LET range, reliability at 0.07 particles/(cm<sup>2</sup>•10-minutes) >

# Within this LET range, reliability at 0.07 particles/(cm<sup>2</sup>•10-minutes) > 99.9% for both design implementations. We can refine by analyzing smaller bins.



# **Example Conclusion**

- Using the proposed methodology, the commercial Xilinx V5 device will meet project requirements.
- In this case, the project is able to save money by selecting the significantly cheaper FPGA device and gain performance because of the embedded PowerPC.



# Conclusions



- This study transforms proven classical reliability models into the SEU particle fluence domain. The intent is to better characterize SEU responses for complex systems.
- The method for reliability-model application is as follows:
  - SEU data are obtained as MFTF.
  - Reliability curves (in the fluence domain) are calculated using MFTF; and are analyzed with a piecemeal approach.
  - Environment data are then used to determine particle flux exposure within required windows of mission operation.
- The proposed method does not rely on data-fitting and hence removes a significant source of error.
- The proposed method provides information for highly SEUsusceptible scenarios; hence enables a better choice of mitigation strategy.
- This is preliminary work. There is more to come regarding environment data transformation.

This methodology expresses SEU behavior and response in terms that missions understand via classical reliability metrics.

# Acknowledgements

- Some of this work has been sponsored by the NASA Electronic Parts and Packaging (NEPP).
- Thanks is given to the NASA Goddard Radiation Effects and Analysis Group (REAG) for their technical assistance and support. REAG is led by Kenneth LaBel and Jonathan Pellish.

**Contact Information:** 

Melanie Berg: NASA Goddard REAG FPGA Principal Investigator: Melanie.D.Berg@NASA.GOV