FPGA Mitigation Strategies for Critical Applications

Melanie Berg, SSAI in support of NASA/GSFC
Melanie.D.Berg@NASA.gov

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019.
Acronyms

- Application specific integrated circuit (ASIC)
- Block random access memory (BRAM)
- Block Triple Modular Redundancy (BTMR)
- Clock (CLK or CLKB)
- Clock domain crossing (CDC)
- Clock period ($t_{clk}$)
- Combinatorial logic (CL)
- Configurable Logic Block (CLB)
- Constant false alarm rate filter (CFAR)
- Device under test (DUT)
- Digital Signal Processing Block (DSP)
- Distributed triple modular redundancy (DTMR)
- Dual interlocked storage cell (DICE)
- Edge-triggered flip-flops (DFFs)
- Error detection and correction (EDAC)
- Error rate ($dE/dt$)
- Field programmable gate array (FPGA)
- Finite impulse response filter (FIR)
- Gate Level Netlist (EDF, EDIF, GLN)
- Global triple modular redundancy (GTMR)
- Input – output (I/O)
- Intellectual property (IP)
- INV (inverter)
- Linear energy transfer (LET)
- Local triple modular redundancy (LTMR)
- Logic equivalency checking (LEC)
- Look up table (LUT)
- Mean fluence to failure (MFTF)
- Mean Time to Failure (MTTF)
- One time programmable (OTP)
- Operational frequency ($f_s$)
- Power on reset (POR)
- Place and Route (PR)
- Probability of flip-flop upset ($P_{OFFSEU\rightarrow SEU}$)
- Probability of logic masking ($P_{logic}$)
- Probability of transient generation ($P_{gen}$)
- Probability of transient capture ($P_{SET\rightarrow SEU}$)
- Probability of transient propagation ($P_{prop}$)
- Radiation Effects and Analysis Group (REAG)
- Reprogrammable (RP)
- Single event functional interrupt (SEFI)
- Single event effects (SEEs)
- Single event latch-up (SEL)
- Single event transient (SET)
- Single event upset (SEU)
- Single event upset cross-section ($\sigma_{SEU}$)
- Static random access memory (SRAM)
- Static timing analysis (STA)
- System on a chip (SOC)
- Time delay ($\tau_{dly}$)
- Total Ionizing Dose (TID)
- Transient width ($\tau_{width}$)
- Universal Serial Bus (USB)
- Virtex-5QV (V5QV)
- Windowed Shift Register (WSR)
Agenda

- Field Programmable Gate Array (FPGA) Devices: Challenges for Critical Applications and Space Radiation Environments
- Single Event Upsets (SEUs) and FPGA Configuration
- SEUs and Single Event Transients (SETs) in FPGA Data-paths
- Fail-Safe Strategies for Critical Applications
Motivation: Concerns for using FPGA Devices in Critical Applications

- **Safety**: can life be negatively impacted?
- **Reliability**: will the device operate as expected?
- **Availability**: Includes down-time… is it acceptable?
- **Recoverability**: if the device malfunctions, can the system come back to a working state? Destructive?
- **Trust**: Will the insertion of the device compromise security?
Minimizing Risk: Failure Is Not An Option

- Critical applications: Scrutiny of design and assurance is a must in order to avoid malfunction or catastrophe.
  - Carefully evaluate mitigation… is it applicable?
  - Over evaluate limitations/disadvantages/failure modes… everything must be scrutinized… what can go wrong?
  - Figure out how to use/implement a solution and what other solutions may be necessary to close gaps of risk.
- Question… challenge … discuss … absorb … innovate!

Community Effort

Your solution might go further than a publication
FPGA Devices: Challenges for Critical Applications and Space Radiation Environments
Protecting A Critical System from Failure

- Always take into account mission requirements.

- Investigate failure modes – evaluate and minimize risk:
  - Reliability and functional testing (temperature, voltage, mechanical, and logic switching stresses).
  - Radiation testing: Single event effects (SEE), total ionizing dose (TID), and other types.
  - Verification of assurance: analyze test procedures, results, and application to mission solutions.

- Wisely add mitigation – requirements driven:
  - Replication with or without correction.
  - Detection:
    - Switch to another device
    - Try to recover state
    - Start over
    - Alert
    - Do nothing…
  - Filtration: e.g., time delay filter (TD),
  - Masking: Protect system operation from failures.
Investigating Failure Modes: Radiation Testing and SEU Cross Sections

System failures due to SEEs are second order:

- Probability that a transistor will change state.
- Probability the SEU or SET will cause malfunction.

**Terminology:**

- Flux: Particles/(sec-cm²)
- Fluence: Particles/cm²
- Linear energy transfer (LET)

$\sigma_{\text{seu}}$s are calculated at several LET values (particle spectrum).

Mean fluence to failure (MFTF) is the inverse of $\sigma_{\text{seu}}$.

$$\sigma_{\text{seu}} = \frac{\text{#errors}}{\text{fluence}} \quad \text{MFTF} = \frac{1}{\sigma_{\text{seu}}}$$

NEPP Heavy-ion testing at Texas A&M University

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
Device specific and design specific failure modes.

\[ P(f,s)_{\text{error}} \propto f(P(f,s)_{\text{Configuration}}, P(f,s)_{\text{Functional Logic}}, P(f,s)_{\text{SEFI}}) \]

Design \( \sigma_{\text{SEU}} \)  
Configuration \( \sigma_{\text{SEU}} \)  
Functional logic  
SEFI \( \sigma_{\text{SEU}} \)
Preliminary Design Considerations for Mitigation And Trade Space

\[ P(f s)_{error} = f (P(f s)_{Configuration}, P(f s)_{functionalLogic}, P(f s)_{SEFI}) \]

- Based on mission requirements and target device susceptibilities...
  - Does the designer need to add mitigation?
  - How should the designer add mitigation?
  - How can we assure the mitigation?
- Will there be compromises?
  - Performance and speed
  - Power
  - Schedule
  - Mitigating the susceptible components?
  - Reliability (working and mitigating as expected)?

Security is now an additional component in the trade space.
Verify Applied Mitigation and Protection:
THIS IS CHALLENGING!
Theoretical assumptions and modeling do not always match reality…
Too many unknowns to model
Single Event Upsets and FPGA Configuration

\[ P_{\text{configuration}} + P_{\text{SEFI}} + P_{\text{functionalLogic}} \]
FPGA Design Process and Configuration Creation

Product Requirements

Hardware Description Language (HDL)
Custom Design
IP Acquisition
Gate Level Netlist
Mapped Place and Route
Bitstream

Simulation, Static Timing Analysis (STA), Formal methods, linting, Clock Domain Crossing (CDC), etc.
Simulation, (STA), Formal methods, CDC, Logic Equivalency Checking (LEC), etc.
Simulation, Static Timing Analysis (STA), Formal methods, linting, Clock Domain Crossing (CDC), etc.

Download Bitstream to FPGA Configuration
Reverse engineering

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
FPGA Devices Are Defined by Their Configuration Type

**HDLMAPPING**

**Configuration Defines:**
Arrangement of pre-existing logic via programmable switches
- Functionality (logic cluster)
- Connectivity (routes)

**Programming Switch Types:**
- **Antifuse:** One time Programmable
- **SRAM:** Reprogrammable
- **Flash:** Reprogrammable
FPGA Configuration Implementation and SEU Susceptibility

ANTIFUSE (one time programmable)

SRAM (reprogrammable)

Configuration SEU Susceptible

Configuration SEU Immune

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
Configuration SEU Test Results and the REAG FPGA SEU Model

Table shows the most significant SEE responses during accelerated radiation testing.

\[ P(\text{error}) \propto P(\text{Configuration}) + P(\text{functional Logic}) + P(\text{SEFI}) \]

<table>
<thead>
<tr>
<th>Configuration Type</th>
<th>REAG Model</th>
</tr>
</thead>
<tbody>
<tr>
<td>Antifuse</td>
<td>[ P(fS)_{\text{error}} ]</td>
</tr>
<tr>
<td>SRAM (non-mitigated)</td>
<td>[ P(fS)<em>{\text{Configuration}} + P(fS)</em>{\text{SEFI}} ]</td>
</tr>
<tr>
<td>Flash</td>
<td>[ P(fS)<em>{\text{functional Logic}} + P(fS)</em>{\text{SEFI}} ]</td>
</tr>
<tr>
<td>Hardened SRAM</td>
<td>[ P(fS)<em>{\text{Configuration}} + P(fS)</em>{\text{functional Logic}} + P(fS)_{\text{SEFI}} ]</td>
</tr>
</tbody>
</table>
### What Does The Last Slide Mean?

<table>
<thead>
<tr>
<th>FPGA Configuration Type</th>
<th>Susceptibility</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Antifuse</strong></td>
<td>Configuration has been designated as hard regarding SEEs. Susceptibilities only exist in the data paths and global routes. However, global routes are hardened and have a low SEU susceptibility.</td>
</tr>
<tr>
<td><strong>SRAM (non-mitigated)</strong></td>
<td>Configuration has been designated as the most susceptible portion of circuitry. All other upsets (except for global routes) are too statistically insignificant to take into account. E.g., it is a waste of time to study data path transients, however clock transient studies are significant.</td>
</tr>
<tr>
<td><strong>Flash</strong></td>
<td>Configuration has been designated as hard (but NOT immune) regarding SEEs. Susceptibilities also exist in the data paths and global routes (e.g., clocks and resets).</td>
</tr>
<tr>
<td><strong>Hardened SRAM</strong></td>
<td>Configuration has been designated as hardened (but NOT hard) regarding SEEs. Susceptibilities also exist in the data paths and global routes (e.g., clocks and resets).</td>
</tr>
</tbody>
</table>
Take Note: Configuration SRAM is NOT Utilized the Same Way as Traditional SRAM

- Direct connections from configuration to user logic.

An affected active/used bit has the ability to instantaneously cause an unexpected effect.

No Read-Write cycle required!
Example: Routing Configuration
Upsets in a Xilinx Virtex FPGA

One configuration bit flip can cause significant malfunction. Mitigate appropriately (Single Event Functional Interrupt (SEFI)).

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
Fixing SRAM-based Configuration…Scrubbing Definition

- We address configuration susceptibility via scrubbing: Scrubbing is the act of simultaneously writing into FPGA configuration memory as the device’s functional logic area is operating with the intent of correcting configuration memory bit errors.

Configuration scrubbing only pertains to SRAM-based configuration devices.
Scrubbers: Internal versus External

- Internal and external scrubbers are implemented to correct configuration bit-flips:
  - Internal scrubber: is created out of hard cores that reside inside the FPGA device; or is created out of user fabric logic blocks located inside the FPGA device.
  - External scrubber is implemented in a separate device.
- Typically, external scrubbers are implemented in anti-fuse FPGA or flash-based FPGAs.
- Internal scrubbers are more susceptible than external scrubbers.
Scrubbers: When Reality Defies Theory

- Internal scrubbers are expected to provide satisfactory results in proton and neutron environments.
  - Scrubber clock circuitry are not highly susceptible to protons or neutrons because of their high drive strength.
  - Scrubber should not require a large amount of circuitry.
- Note: Proton radiation testing of the Intel Cyclone 10 showed the device’s internal scrubber does not work as expected.
  - Scrubber failed to remain operable with a fluence of $1 \times 10^8$ particles/cm$^2$ at 100MeV.
  - Results are unexpected.
- Implementation of the scrubber means everything!
  - Did Intel use a processor based internal scrubber?
  - Use of memory will cause the scrubber to be more susceptible than expected.
  - Is the scrubber based on single error correct double error detect (SECDED)... multiple bit upsets will break the scrubber and potentially write bad things into the configuration.
Scrubbing Warning!

Correcting a configuration bit does not fix the state in the functional logic path.

Reliably getting to an expected state after a configuration-bit SEU (that affects the design’s functionality) requires one of the following:

- **Fix configuration bit + (reset or correct DFFs) or**
- **Full reconfiguration.**
Example: Routing Configuration
Upsets in a Xilinx Virtex FPGA

Configuration + design state must be corrected after a configuration
SEU hit.
Data-path SEU Susceptibility and Analysis: the NASA Electronic Parts and Packaging (NEPP) FPGA Model

Synchronous Design Methodology and SEU Modeling

- Design topology dictates SEU susceptibility
  - Topology: how are components connected and how is data flow controlled:
    - Edge triggered Flip flops (DFFs) versus latches
    - Logic cells: Sea of gates versus look up tables (LUTs)
    - High capacitive routing
    - Large fan-out or large fan-in or feedback paths
- Synchronous design has proven to be the most reliable means of design and is the most common design topology used worldwide
  - Makes state transitioning deterministic (reliability matters).
  - Provides distinct boundary points – relates state to clock cycle.
  - Easiest method to verify.
  - Automated design tools are geared for synchronous methodology.

This presentation pertains to synchronous designs.
Synchronous Design Data Path Components

• Designs are comprised of:
  • Combinatorial Logic (CL)
  • Edge Triggered Flip-Flops (DFFs)
• Each DFF has a cone of logic that feeds its input pin (modular approach to system analysis).
  • EndPoint is DFF point of analysis.
  • Startpoints are DFFs (or inputs) that feed the EndPoint.
  • If an EndPoint has feedback, then it is its own StartPoint.
• There is a delay from StartPoint DFF to EndPoint DFF (routes and CL): $\tau_{dly}$
  • $\tau_{dly}$ is CL compute time

DFFs and CL Cone of Logic

Start Point
DFFs

A

3ns

1ns

2ns

0.5ns

End Point
DFF
(A XOR B)

B

1ns

1ns

C

1ns

1ns

3ns

4ns

$\tau_{dly} = 9.5ns$
Synchronous Design Clocks And Data Path Components

- All DFFs are connected to a clock (clock tree).
- Clock tree is a balanced structure such that each DFF will experience a clock edge at virtually the exact moment in time.
- Clock period: \( \tau_{\text{clk}} \)
- Clock frequency: \( f_s \)

\[ \tau_{\text{clk}} = \frac{1}{f_s} \]

\( \tau_{\text{dly}} < \tau_{\text{clk}} \) — *(setup time + overhead)*

- Synchronous design: compute and hold in a deterministic manor.
- DFF is locked during compute time (its state cannot be affected)
- Created to reduce faults due to transistor switching noise.
- Less susceptible to SEEs than asynchronous design.
SEUs and SETs in Combinatorial Logic and Edge Triggered Flip Flops

<table>
<thead>
<tr>
<th>Combinatorial (CL)</th>
<th>Sequential (DFF)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Logic function generation (computation)</td>
<td>Captures and holds state of data input at rising edge of clock</td>
</tr>
</tbody>
</table>

**SET**

Glitch in the CL:

- **Single Sided**
- **Double Sided**

Example is an edge triggered DFF. Its effects are significantly different than a latch due to master-slave topology.
SEEs and How They Affect Synchronous System Next State

SEEs are asynchronous – most occur somewhere in the clock cycle.

Question? If a SEU or a SET occurs, will it cause the next state of the system be incorrect?

Modularization: Every DFF has a cone of logic.

End Point DFF SEUs + Start Point DFF SEUs + CL SETs

DFF upsets that occur at the clock edge. Internal DFF transient gets latched.

DFF upsets that occur between clock edges and are captured by EndPoints. Internal DFF latch flips state.

Single Event Transients captured by EndPoints.
FPGA SEU Model Data Path Components

- DFF SEU: Generation in DFF ($P_{\text{DFFSEU}}$)

- StartPoint DFF SEU Capture $P(f_s)_{\text{DFFSEU\rightarrow SEU}}$:
  - Logic Masking ($P_{\text{logic}}$)
  - Capture: DFFs are masked from what happens during compute time. Did the StartPoint SEU occur early enough in the clock cycle for the EndPoint to capture the upset? Is the StartPoint Logically masked from the EndPoint?

- CL SET Capture $P(f_s)_{\text{SET\rightarrow SEU}}$:
  - SET Generation ($P_{\text{gen}}$)
  - SET Propagation strength ($P_{\text{prop}}$)
  - Logic Masking ($P_{\text{logic}}$)
  - Capture: Can the SET propagate without being masked to reach the EndPoint DFF at the next clock edge?
StartPoint SEUs And System Next State:

Does the StartPoint Flip Early Enough In The Clock Cycle for The EndPoint To Detect?

If DFF$_D$ flips its state @ time=$\tau$:

$$0 < \tau + \tau_{dly} < \tau_{clk}$$

Will the upset have time to get caught?
Potential For A StartPoint SEU To Affect Its EndPoint Next State (Temporal Masking)

StartPoint SEU occurs at a point in time within a clock cycle

\[ \tau + \tau_{dly} < \tau_{clk} \]

\(\tau_{dly}\) (Delay from StartPoint to the EndPoint) is fixed (road block). EndPoint will only be affected if change reaches its pin by the next clock cycle.

Only portion of clock cycle that StartPoint can flip and affect next state

\[ \frac{\tau}{\tau_{clk}} < \frac{\tau_{clk} - \tau_{dly}}{\tau_{clk}} = 1 - \frac{\tau_{dly}}{\tau_{clk}} \]

Probability for EndPoint to detect flipped StartPoint at clock edge (next state). The path delay bounds probability of capture.

\[ \tau \cdot f_s < 1 - \tau_{dly} \cdot f_s \]

The probability that a StartPoint DFF SEU will affect the system next state is inversely proportional to system frequency.
Logic Masking: $P_{\text{logic}}$

$P_{\text{logic}}$: Probability that an SET can logically propagate through a cone of logic. Based on state of the combinatorial logic gates and their potential masking.

“AND” gate reduces probability that SET will logically propagate

\[ 0 < P_{\text{logic}} < 1 \]

- Determining $P_{\text{logic}}$ for a complex system can be very difficult in real applications.
- Simulation and fault injection will only provide a glimpse of the state space; and is not sufficient to determine $P_{\text{logic}}$. 
Details of Capturing StartPoint DFFs

\[ \bigvee_{DFF} \left( \sum_{j=1}^{\text{#StartPoint DFFs}} \beta P(f_s)^{DFFSEU(j)} (1 - \tau_{dly(j)} f_s) P_{logic(j)} \right) \]

- SEU generation occurs in a StartPoint between rising clock edges \((\beta P(f_s)^{DFFSEU})\).
- Design topology and temporal effects \((\tau_{dly})\):
  - Increase path delay – decrease probability of capture
  - Increase frequency – decrease probability of capture
- StartPoint upsets can be masked by logic between the StartPoint and its EndPoint \((P_{logic})\).
Synchronous System: CL SET Capture

Difference between a CL SET and StartPoint DFF-SEU:
Double sided glitch versus single sided state switch.

Probability of SET capture by EndPoint DFF is proportional to the width of the SET.
SET Propagation to an EndPoint DFF: $P_{prop}$

- In order for the data path SET to become an upset, it must propagate and be captured by its Endpoint DFF.

- $P_{prop}$ only pertains to electrical medium (resistance and capacitance (RC) of path)
  - RC can cause SET amplitude reshaping
  - RC can cause SET width reshaping
  - RC can cause SET oscillation

- Small SETs or paths with high RC have low $P_{prop}$

- Depends on LET – be careful of fault injection results – they do not take into account the correlation between LET and SET strength (size).
Details of CL SET Capture in a Synchronous System: $P(\text{fs})_{\text{DFF} \rightarrow \text{SET}}$

\[
\forall \text{DFF} \left( \sum_{i=1}^{\#\text{Combinatorial Cells}} \left( P_{\text{gen}(i)} P_{\text{prop}(i)} P_{\text{logic}} \tau_{\text{width}(i)} \text{fs} \right) \right)
\]

- Note: difference in probability analysis between CL SET and StartPoint SEU is due to double edge function versus single sided.
- Increase frequency – increase probability of capture.
- Increase CL – increase probability of capture?? Might create more masking (error detection and correction is the perfect example).
- Increase LET – increase the width of the SET.

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
NEPP FPGA Model: Putting it All Together

... Analyzed Per Particle LET

\[
\sum_{k=1}^{\text{EndPoint}} \text{Logic Masking} \cdot P_{\text{logic}(k)} * \left( \sum_{j=1}^{\text{StartPoint}} \text{DFFs} \right) \left( \sum_{i=1}^{\text{CL}} \left( P_{\text{gen}(i)} * P_{\text{prop}(i)} * P_{\text{logic}(i)} * \tau_{\text{width}} h_{(i)} f s \right) \right) + \alpha P(f s)_{\text{DFSEU}(k)} + \beta P(f s)_{\text{DFSEU}(j)} \left( 1 - \tau_{\text{dly}}(j) f s \right) * P_{\text{logic}(j)} + \gamma P(f s)_{\text{DFSEU}(j)} \left( 1 - \tau_{\text{dly}}(j) f s \right) * \tau_{\text{width}} h_{(i)} f s
\]

- Model is not expected to qualify a design (\(P_{\text{logic}}\) is too difficult to predict).
- Model is expected to assist in data analysis (clarifies events).
- Can determine if DFFs are more dominant than SETs:
  - Indirectly proportional to frequency – then DFFs are dominant.
  - Directly proportional to frequency – then SETs are dominant.
- Same philosophy can be used to determine mitigation strength.
**NEPP FPGA Model: Mitigation…**

**Analyzed Per Particle LET**

\[
\begin{align*}
\text{EndPoint} & \quad \alpha P(fs)_{\text{DFSEU}(k)} + \\
\text{StartPoints} & \quad (\beta P(fs)_{\text{DFSEU}(j)}(1 - \tau_{\text{dly}(j)fS})) \times P_{\text{logic}(j)}) + \\
\text{CL} & \quad \sum_{i=1}^{\#CL} \left( P_{\text{gen}(i)} \times P_{\text{prop}(i)} \times P_{\text{logic}(i)} \times \tau_{\text{widt}h(i)fS} \right)
\end{align*}
\]

- **\( P_{\text{logic}} \)** can be implemented by a designer for SEE mitigation.
- **\( P_{\text{prop}} \) and **\( P_{\text{gen}} \) can mitigate SEE by process, technology geometries, or user placement.
- **\( P_{\text{logic}} \)** manipulation is the most common method of mitigation implementation (triple modular redundancy, disable, error detection and correction (EDAC)). Deterministic mitigation is essential.
- Note– simply increasing the number of DFFs or increasing CL does not mean you have increased your susceptibility (because of **\( P_{\text{logic}} \))... current prediction methodologies fail us!
Warning: Clock Trees and SETs

- Examples only considered data paths.
- However, clock and reset trees (global routes) are susceptible to SETs.
- Clock trees in ASICs and FPGAs are the most overlooked mechanism of failure due to ionization.
- Global route susceptibilities must be taken into account when determining system risk.
- Global route susceptibilities are different for each FPGA device.

There is not much a user can do to mitigate clock tree SETs. However, it is imperative to know susceptibilities – probability of occurrence and associated error signatures.
Fail-safe Strategies for Data-Path Single Event Upsets (SEUs)

• The following slides will demonstrate commonly used mitigation strategies for FPGA devices.

• What you should learn:
  – The differences between mitigation strategies.
  – Strengths and weaknesses of various strategies.
  – Questions to ask or considerations to make when evaluating mitigation schemes.
  – Which mitigation schemes are best for various types of FPGA devices.

• The scope of this presentation will cover fail-safe strategies for configuration and data-path SEUs
Goal for critical applications: Limit the probability of system error propagation and/or provide detection-recovery mechanisms via fail-safe strategies.
Differentiating Fail-Safe Strategies:

• Detection:
  – Watchdog (state or logic monitoring).
  – Can range from simplistic checking to complex Decoding.
  – Action (alerting, correction, or recovery).

• Masking (does not mean correction):
  – Preventing error propagation to other logic.
  – Requires redundancy + mitigation or detection.
  – Turn off faulty path.

• Correction (error may not be masked):
  – Error state (memory) is changed/fixed.
  – Need feedback or new data flush cycle.

• Recovery:
  – Bring system to a deterministic state.
  – Might include correction.
Redundancy Is Not Enough

• Simply adding redundancy to a system is not enough to assume that the system is well protected.

• Questions/Concerns that must be addressed for a critical system expecting redundancy to cure all (or most):
  – Define system failure... what is tolerable and what is not.
  – How does the system recover from SEE?
  – How is redundancy implemented?
  – What portions of your system are protected? Does the protection comply with the results from radiation testing?
  – Is detection of malfunction required to switch to a redundant system or to recover?
  – If detection is necessary, how quickly can the detection be performed and responded to?
  – Is detection enough?... Does the system require correction?

Listed are crucial concerns that should be addressed at design reviews and prior to design implementation.
Embedded Mitigation versus User Inserted Mitigation
Radiation Hardened (per SEU) versus Commercial FPGA Devices

• For this presentation, a radiation hardened (per SEU) device is a device that has embedded mitigation (implemented by FPGA manufacturer – not the user).

• Radiation hardened FPGA devices are available to users. They make the design cycle much easier!

• SEU mitigation is generally applied to the following:
  – Data-path elements:
    • Localized redundancy inserted into library cell flip-flops (DFFs).
      – Localized Triple Modular Redundancy (LTMR) or
      – Dual interlocked Cell (DICE)
    • SET filters inserted on the DFF data input pin.
    • SET filters inserted on the DFF clock input pin.
  – Global routes.
  – Memory cells.
Localized Redundancy Embedded in Manufacturer DFF Cells

Warning! These figures are simplified schematics of the actual implementation.

**Dual Interlocked Cell (DICE)**

**Localized Triple Modular Redundancy (LTMR)**

Problem! Although DFFs are protected, SETs from the combinatorial logic in the data path and SETs in the global routes can cause incorrect data to be captured by the DFF.

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
Embedded Temporal Redundancy (TR): SET Filtration in The Data Path

- Temporal Filter placed directly before DFF.
- Localized scheme that reduces SET capture in the data path.
- Delays must be well controlled.
  - Every delay path shall consistently have a predefined delay and must be verified.
  - Remember: critical applications require deterministic mitigation.
- Do not implement TR as a user inserted mitigation scheme. Delay must be deterministic and it is too difficult to manage with place and route tools (for real applications).
- Maximum Clock frequency is reduced by the amount of new delay.

Crude example of TR implementation

Embedded cell implementation is ok. User fabric implementation is not ok.
Embedded Radiation Hardened Global Routes: SET Filtration in The Global Route Path

- Some FPGAs contain radiation-hardened clock trees and other global routes (Microsemi products only).
- Global structures are generally hardened by using larger buffers.
- TR has also been used on the DFF clock pin... (Xilinx V5QV only).

Global route susceptibility is often overlooked. Beware, many devices do not have hardened global routes.
# FPGA Devices and Manufacturer Embedded Mitigation

**DFF: flip flop**

**DICE: Dual interlocked Cell**

<table>
<thead>
<tr>
<th>Configuration Type</th>
<th>Short List of Device Families</th>
<th>Embedded Mitigation</th>
<th>Most Susceptible Components</th>
</tr>
</thead>
<tbody>
<tr>
<td>SRAM</td>
<td>Stratix, Virtex, Kintex</td>
<td>No</td>
<td>Configuration and clock trees</td>
</tr>
<tr>
<td>Antifuse</td>
<td>RTAX, RTSXS</td>
<td>DFFs and clocks (configuration is already hardened by nature)</td>
<td>Combinatorial logic (however susceptibility considered low)</td>
</tr>
<tr>
<td>Flash</td>
<td>ProASIC3, RTG4, SmartFusion(2)</td>
<td>Configuration is already hardened by nature.</td>
<td>ProASIC3 and SmartFusion: DFFs and clocks; RTG4: clocks and SETs</td>
</tr>
<tr>
<td>Hardened SRAM</td>
<td>Virtex V5QV</td>
<td>Configuration + DICE DFFs + SET filters</td>
<td>Clocks. In some cases additional mitigation may be necessary for configuration and DFFs</td>
</tr>
</tbody>
</table>

Go to [http://radhome.gsfc.nasa.gov](http://radhome.gsfc.nasa.gov), manufacturer websites, and other space agency sites for more information on SEU data and total ionizing dose data.
User Inserted Mitigation:
Flushing, Dual Redundancy, Cold Sparing, and Triple Modular Redundancy (TMR)
Most Effective Mitigation Strategies

• Hire knowledgeable and experienced design/verification engineers. Implementation of design matters.

• Understand the target environment and mission requirements.
  – Reliability
  – Availability
  – Weigh consequences of failure

• Plan ahead:
  – How to partition and manage power domains (how often can you power flush, what circuitry are affected during power flush (currently used for micro-latchup).
  – How to efficiently insert mitigation.
  – How to alert.
  – How to synchronize redundant circuits (if necessary).
  – Beware of separate clock domain drift.

• Use data/information that best suites you: differentiate between un-vetted research and application oriented.
Most Commonly Implemented System Level Mitigation: Reset or Flush… Keeping It Simple!

- Critical applications require all registers (flip-flops) to be connected to a reset.
- A reset is used to force the system to a known (expected) state in a deterministic time period.
- Requires detection of malfunction; or user controlled maintenance scheme.
- All elements are expected to be able to operate from the reset state. However:
  - For some FPGAs, a reset is not enough. The configuration might also have to be flushed (reconfigure or scrub).
  - Availability is affected.
  - Next state information during event is most likely lost.
  - All must be taken into account when determining the effect of activating a reset in a system.

**Warning: Resets are susceptible to SEEs**
Dual Redundancy

• Dual redundant systems cannot correct (roll-back is an exception).
• Dual redundant systems are great for detection (watch-dogs).
• “Compare and Alert” systems must be highly reliable and verifiable.
• Generally not all I/O can be monitored or compared.
  – Best used for data calculation and manipulation… easiest to place compares on data buses.
• Can run in lockstep or free running. Each have unique advantages and limitations (cons).
Cold Sparing: Elongation of System Operation

- One active system and alternate inactive systems.
- Upon active system failure, an inactive system is turned on.
- System operation is able to be elongated after failure.
- However:
  - Availability is affected… there is downtime.
  - Can your system afford the downtime (critical application)?
  - How clean is the system switch over?
  - How long is the system switch over.
- Can the system ping-pong between active and inactive components or is that portion of the system considered dead after failure?
  - Ping-ponging can be used for systems that have a low probability of destructive failures.
  - Ping-ponging can be complex and can affect availability.

Mostly used for degradation mitigation (no ping-pong)
Multiple Flushable Components (Sensor Example)

- Each sensor captures a frame of data.
- Time-tag each frame of data.
- Central unit processes and organizes frames.
- Synchronization signal to start frames.
- Synchronization is challenging... clock skew or system drift.

- If one or more components fall out (fail), then synchronize on next frame (not always easy).
- Must strategize for bulk failures.
Partial Mitigation

- Implementation of mitigation can be limited by:
  - Gate count
  - Timing
  - Accessibility to logic (e.g. IP cores)
  - Efficacy of tool

- How do we handle the challenge?
  - Research world: build tools to perform partial mitigation
  - Critical application world: requirements dictate that all inserted mitigation must be proven protection. Currently, partial mitigation is driven by requirements and noted criticality of components regarding mission success.

- Having an automated tool decide where to put protection is risky:
  - Tools are not ready to handle the abundance of parameters yet.
  - Mitigation becomes difficult to assure after implementation (theory versus reality)... adds to risk factor.
System versus Design Mitigation

• The previous slides were affiliated with system level mitigation.

• System level mitigation generally has:
  – Detection, masking, no correction, downtime, and recovery actions.

• The following slides will discuss triple modular redundancy (TMR) techniques that can be implemented as system or design-level mitigation.

• Most of the TMR techniques will incorporate masking and detection with no downtime (unless there is a single functional interrupt (SEFI)).

• Hence, TMR can improve system performance, availability, and elongate operation time.
Mitigation – Fail Safe Strategies That Do Not Require Fault Detection but Provide SEU Masking and/or Correction: Triple Modular Redundancy (TMR)… best two out of three.
How To Insert TMR into A Design:

- **Functional Specification**
- **HDL**
- **Synthesis**
- **Place and Route**
- **Create Configuration**

**HDL:** Hardware description language

**Output of synthesis is a gate netlist that represents the given HDL function.**

TMR can be written into the HDL. Generally not done because too difficult.

TMR can be inserted during synthesis or post synthesis.

If inserted post synthesis, the gate level net-list is replicated, ripped apart, and voters + feedback are inserted.
Various TMR Schemes: Different Topologies

Block diagram of block TMR (BTMR): a complex function containing combinatorial logic (CL) and flip-flops (DFFs) is triplicated as three black boxes; majority voters are placed at the outputs of the triplet.

Block diagram of local TMR (LTMR): only flip-flops (DFFs) are triplicated and data-paths stay singular; voters are brought into the design and placed in front of the DFFs.

Block Diagram of distributed TMR (DTMR): the entire design is triplicated except for the global routes (e.g., clocks); voters are brought into the design and placed after the flip-flops (DFFs). DTMR masks and corrects most single event upsets (SEUs).

Same Definitions used by Mentor and Synopsys
TMR Implementation

• As previously illustrated, TMR can be implemented in a variety of ways.
• The definition of TMR depends on what portion of the circuit is triplicated and where the voters are placed.
• The strongest TMR implementation will triplicate all data-paths and contain separate voters for each data-path.
  – However, this can be costly: area, power, and complexity.
  – Hence a trade is performed to determine the TMR scheme that requires the least amount of effort and circuitry that will meet project requirements.
• Presentation scope: Block TMR (BTMR), Localized TMR (LTMR), Distributed TMR (DTMR), Global TMR (GTMR).
Block Triple Modular Redundancy: BTMR

• Need Feedback to Correct.
• Cannot apply internal correction from voted outputs.
• If blocks are not regularly flushed (e.g. reset), Errors can accumulate – may not be an effective technique.
Examples of a Flushable BTMR Designs

- Shift Registers.
- Transmission channels: It is typical for transmission channels to send and reset after every sent packet.
- Systems that can be reset (or power-cycled) every so-often… Yes that includes processors.

Transmission channel example:
Explanation of BTMR Strength and Weakness using Classical Reliability Models

<table>
<thead>
<tr>
<th>Reliability for 1 block ($R_{\text{block}}$)</th>
<th>Reliability for BTMR ($R_{\text{BTMR}}$)</th>
<th>Mean Time to Failure for 1 block (MTTF$_{\text{block}}$)</th>
<th>Mean Time to Failure BTMR (MTTF$_{\text{BTMR}}$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>$e^{-\lambda t}$</td>
<td>$3e^{-2\lambda t} - 2e^{-3\lambda t}$</td>
<td>$1/\lambda$</td>
<td>$(5/6 \lambda) = 0.833/\lambda$</td>
</tr>
</tbody>
</table>

**Reliability across Fluence: Simplex System versus BTMR Version**

- **System No TMR**
- **BTMR System**

Operating a BTMR design in this time interval will provide an increase in reliability. However, over time, BTMR reliability drops off faster than a system with No TMR.

$$\lambda = \frac{\text{Failures}}{\text{Time}}$$
Classical Reliability: BTMR Bottom Line

• Concerns and limitations:
  – What is your reliable window of operation relative to the MTTF for one unmitigated block?
  – Overtime, a BTMR system has lower reliability than an unmitigated system.
  – Applying additional replicated blocks (e.g., N-out-of-M) will only increase the reliability during the short window near start time. However, overtime, the reliability of an N-out-of-M system will fall faster as M (the number of replicated blocks) grows.

• Benefits!!!!
  – BTMR can block an error from propagating to other areas of the system.
  – BTMR is a good (simple) solution for flushable-systems.
Additional BTMR Warnings

- With BTMR, not all I/O can be monitored.
- Should address first system failure when it occurs and correct system state.... And to do so… Usually need an additional detection signal to know when one of the systems are in failure.
- AVAILABILITY!
What Should be Done If Availability Needs to be Increased?

- If the blocks within the BTMR have a relatively high upset rate with respect to the availability window, then stronger mitigation must be implemented.
- Bring the voting/correcting inside of the modules... bring the voting to the module DFFs.

The following slides illustrate the various forms of TMR that include voter insertion in the data-path.

<table>
<thead>
<tr>
<th>TMR Nomenclature</th>
<th>Description</th>
<th>TMR Acronym</th>
</tr>
</thead>
<tbody>
<tr>
<td>Local TMR</td>
<td>DFFs are triplicated</td>
<td>LTMR</td>
</tr>
<tr>
<td>Distributed TMR</td>
<td>DFFs and CL-data-paths are triplicated</td>
<td>DTMR</td>
</tr>
<tr>
<td>Global TMR</td>
<td>DFFs, CL-data-paths and global routes are triplicated</td>
<td>GTMR or XTMR</td>
</tr>
</tbody>
</table>
Describing Mitigation Effectiveness Using A Model

\[ P(\text{error}) \propto P_{\text{configuration}} + P(\text{functional Logic}) + P_{\text{SEFI}} \]

\[ P(\text{DFF SEU} \rightarrow \text{SEU}) + P(\text{SET} \rightarrow \text{SEU}) \]

- \( P(\text{DFF SEU} \rightarrow \text{SEU}) \): Probability that an SEU in a DFF will manifest as an error in the next system clock cycle.
- \( P(\text{SET} \rightarrow \text{SEU}) \): Probability that an SET in a CL gate will manifest as an error in the next system clock cycle.

DFF: Edge triggered flip-flop

CL: Combinatorial Logic

To be presented by Melanie Berg at the School on the Effects of Radiation on Embedded Systems for Space Applications (SERESSA), Seville, Spain, December 5, 2019
Local Triple Modular Redundancy (LTMR)

- Only DFFs are triplicated. Data-paths are kept singular.
- LTMR masks upsets from DFFs and corrects DFF upsets if feedback is used.

- Good for devices where DFFs are most susceptible and configuration and CL susceptibility is insignificant; e.g., Microsemi products.
Windowed Shift Registers (WSRs): NEPP Test Structure

\[ \tau_{dly_{WSR}} > \tau_{dly_{WSR}_0} \]

\[ \tau_{dly} = \text{path delay from DFF to DFF} \]

WSR\textsubscript{0}

WSR\textsubscript{8}

Combinatorial Logic: Inverters
Adding LTMR to a Microsemi ProASIC3 Device versus RTAXs Embedded LTMR

- At lower LETs, applying LTMR to a ProASIC3 design, has similar (a little higher) SEU response to Microsemi RTAXs series.
- At higher LETs, clock tree upsets start to dominate and LTMR in the ProASIC3 is not as effective.
- Depending on your target radiation environment, the ProASIC3 may be acceptable for your application.
- Note: RTAX2000 INV=8 has a lower SEU than RTAX2000 INV=0 at low LET. This is from SET filtering (high capacitive routing – 130nm antifuse).
LTMR Should Not Be Used in An SRAM Based FPGA

Proven via NEPP experiments: SEU data for LTMR implemented in Xilinx FPGA devices are similar or worse than no added mitigation.
Distributed Triple Modular Redundancy (DTMR)

- Triple all data-paths and add voters after DFFs (NOT Clocks).
- DTMR masks upsets from configuration + DFFs + CL and corrects captured upsets if feedback is used.
- Good for devices where configuration or DFFs + CL are more susceptible than project requirements; e.g., Xilinx and Altera commercial FPGAs.

\[
P(f_s)_{error} \propto P_{configuration} + P(f_s)_{functional\ Logic} + P_{SEFI}
\]

\[
P(f_s)_{DFF\ SEU} + P(f_s)_{SET\ SEU}
\]

Low

Minimally Lowered
Xilinx Kintex UltraScale Mitigation Study: 8-bit Counters

First observed DTMR Partition failure

LTMR and BTMR perform near No-TMR!!!!!!!!!!!!

LTMR was not tested at this LET
Comparison of V5QV and Kintex-UltraScale with Mitigation

V5QV Counters:  
*Embedded Mitigation*

Kintex UltraScale DTMR Counters:  
*User Inserted Mitigation*

**DTMR inserted with Synopsys synthesis tool**
Theoretically, GTMR Is The Strongest Mitigation Strategy… BUT…

• Triplicate all clocks, data-paths and add voters after DFFs
• Triplicating a design and its global routes takes up a lot of power and area.
• Skew between clock domains must be minimized such that it is less than the shortest routing delay from DFF to DFF (hold time violation or race condition):
  – Is skew between clock trees in the FPGA small enough? **Most likely not.**
  – Limit skew of clocks coming into the FPGA.
  – Limit skew of clocks from their input pin to their clock tree.
• Difficult to verify.
TMR and Verification

- If a system is required to be protected using TMR, improper insertion can jeopardize the reliability and security of the system.
- There are two primary concerns to TMR insertion:
  - Did the insertion cause incorrect logic implementation?
  - Is the insertion the correct topology?
    - Are all voters inserted where expected? and
    - Are all components triplicated as expected?
  - Example: must be able to differentiate between LTMR, DTMR, or some other implementation.
- Due to the complexity of the verification process and the complexity of digital designs, there are currently no available techniques that can provide complete and reliable confirmation of TMR insertion.
- Critical applications: if protection is required, then its implementation must be assured.

We are working on it!
TMR Rules of Thumb

• FPGAs with embedded mitigation do not usually require additional (user inserted) TMR.
• FPGAs with soft configuration will only benefit from DTMR or BTMR (in appropriate situations).
• FPGAs with hard configuration and no other embedded mitigation will benefit from local mitigation strategies.
TMR Warnings

• There are significant differences between TMR schemes. Select the correct type for your application and requirements.
• Do not use LTMR in a Xilinx Device.
• BTMR is a sufficient mitigation strategy if the required reliability window is relatively small as compared to MTTF of a non-redundant (non-mitigated) system.
• Most FPGAs cannot accommodate the clock skew between clock trees to properly implement GTMR. Best to stay away.
• TMR is difficult to verify. Fault injection is not sufficient for critical applications.
Some Thoughts
Concerns and Challenges of Today and Tomorrow for Mitigation Insertion (1)

- User insertion of mitigation strategies in most FPGA and ASIC devices has proven to be a challenging task because of reliability, performance, area, and power constraints.
  - Difficult to synchronize across triplicated systems,
  - Mitigation insertion slows down the system.
  - Can’t fit a triplicated version of a design into one device.
  - Power and thermal hot-spots are increased.

- The newer commercial devices have a significant increase in gate count and lower power. This helps to accommodate for area and power constraints while triplicating a design. However, this increases the challenge of module synchronization.
Concerns and Challenges of Today and Tomorrow for Mitigation Insertion (2)

- Embedded mitigation has helped in the design process. However, it is proving to be an ever-increasing challenge for manufacturers.
  - We (users) want embedded mitigation: cheaper design flow process, faster, and less power hungry.
  - However, heritage has proven that for critical applications, embedded systems have provided excellent performance and reliability.

- Tool availability… Getting better… IP Cores are still problematic.

- User’s are not selecting the correct mitigation scheme for their target FPGA.

- Mitigation is too complex to fully verify.
Warning

- You should not mitigate failure mechanisms that have insignificant contribution to the overall failure rate:
  - This adds risk.
  - Slows down system.
  - Can provide a false sense of protection.
  - Gain is not significant.

\[ P(f_s)_{\text{error}} \propto P(f_s)_{\text{Configuration}} + P(f_s)_{\text{functionalLogic}} + P(f_s)_{\text{SEFI}} \]
Summary

• For critical applications, mitigation might be required.
• Determine the correct mitigation scheme for your mission while incorporating given requirements:
  – Understand the susceptibility of the target FPGA and potential necessity of other devices.
  – Investigate if the selected mitigation strategy is compatible to the target FPGA device.
  – Calculate the reliability of the mitigation strategy to determine if the final system will satisfy requirements.
  – Ask the right questions regarding functional expectation, mitigation, requirement satisfaction, and verification of expectations.
• Although it is desirable from a user’s perspective to have embedded mitigation, cost seems to be driving the market towards unmitigated commercial FPGA devices. Hence, it will be necessary for user’s to familiarize themselves with optimal mitigation insertion and usage.