Challenges Regarding IP Core Functional Reliability.

Melanie Berg\(^1\), Kenneth LaBel\(^2\)

1. AS&D in support of NASA/GSFC
   Melanie.D.Berg@NASA.gov

2. NASA/GSFC
   Kenneth.A.LaBel@NASA.gov

To be presented by Melanie Berg at the Microelectronics Reliability & Qualification Working Meeting (MRQW), El Segundo, CA February 7-8, 2017
Acronyms

- 10 gigabit attachment unit (XAUI XGS)
- Advanced Encryption Standard (AES)
- Advanced extensible Interface (AXI)
- Advanced High-performance Bus (AHB)
- Agile Mixed Signal (AMS)
- ARM Holdings Public Limited Company (ARM)
- Block random access memory (BRAM)
- Block triple modular redundancy (BTMR)
- Built-in-self-test (BIST)
- Cache Coherent Interconnect (CCI)
- Combinatorial logic (CL)
- Commercial off the shelf (COTS)
- Complementary metal-oxide semiconductor (CMOS)
- Computer aided design (CAD)
- Controller Area Network (CAN)
- Device under test (DUT)
- Digital Signal Processing (DSP)
- Direct Memory Access (DMA)
- Distributed triple modular redundancy (DTMR)
- Double Data Rate (DDR3 = Generation 3; DDR4 = Generation 4)
- Edge-triggered flip-flops (DFFs)
- Equipment Monitor And Control (EMAC)
- Error-Correcting Code (ECC)
- Field programmable gate array (FPGA)
- Floating Point Unit (FPU)
- General purpose input/output (GPIO)
- Global Industry Classification (GIC)
- Global triple modular redundancy (GTMR)
- Hardware description language (HDL)
- High Performance Input/Output (HPIO)
- High Speed Bus Interface (PS-GTR)
- Input – output (I/O)
- Intellectual Property (IP)
- Inter-Integrated Circuit (IIC)
- Internal configuration access port (ICAP)
- Joint test action group (JTAG)
- Lightwatt High Pressure Sodium (LW HPS)
- Linear energy transfer (LET)
- Local triple modular redundancy (LTMR)
- Look up table (LUT)
- Low Power (LP)
- Low-Voltage Differential Signaling (LVDS)
- Memory Management Unit (MMU)
- Microprocessor (MP)
- Multi-die Interconnect Bridge (EMIB)
- MultiMediaCard (MMC)
- Multiport Front-End (MPFE)
- Not OR logic gate (NOR)
- Operational frequency (fs)
- Oscillator (RC OSC)
- Peripheral Component Interconnect Express (PCIe)
- Personal Computer (PC)
- Phase locked loop (PLL)
- Physical layer (PHY)
- Power on reset (POR)
- Probability of flip-flop upset (PFFSEU)
- Probability of logic masking (PLoc)
- Probability of transient generation (Pgen)
- Probability of transient propagation (Pprop)
- Processor (PC)
- Radiation Effects and Analysis Group (REAG)
- Radiation Tolerant (RT)
- Secondary Control Unit (SCU)
- Secure Digital (SD)
- Secure Digital embedded MultiMediaCard (SD/eMMC)
- Secure Digital Input/Output (SDIO)
- Serial Advanced Technology Attachment (SATA)
- Serial Peripheral Interface (SPI)
- Serial Quad Input/Output (QSPI)
- Serializer/deserializer (Serdes EPCS)
- Single event functional interrupt (SEFI)
- Single event latch-up (SEL)
- Single event transient (SET)
- Single event upset (SEU)
- Single event upset cross-section (σSEU)
- Spatial-Division-Multiplexing (SDM)
- Static random access memory (SRAM)
- System Memory Management Unit (SMMU)
- System on a chip (SOC)
- Transceiver Type (GTH/GTY)
- Transient width (width)
- Triple modular redundancy (TMR)
- Universal Asynchronous Receiver/Transmitter (UART)
- Universal Synchronous Receiver/Transmitter (USRT)
- Universal Serial Bus (USB)
- Universal Serial Bus On-the-go (USB OTG)
- Watchdog Timer (WDT)
- Windowed Shift Register (WSR)
Problem Statement

- For many years, intellectual property (IP) cores have been incorporated into field programmable gate array (FPGA) and application specific integrated circuit (ASIC) design flows.
- However, the usage of large complex IP cores were limited within products that required a high level of reliability.
- This is no longer the case. IP core insertion has become mainstream …including their use in highly reliable products.
- Due to limited visibility and control, challenges exist when using IP cores and subsequently compromise product reliability.
IP Core Terminology Regarding FPGA Insertion

• IP cores are blocks of logic elements:
  – Reduce Time-to-Market.
  – Eliminate Design Risks.
  – Reduce Development Costs.

• IP cores can be “Soft” or “Hard.”
  – Terminology has nothing to do with radiation susceptibility.
  – Soft Core: IP logic blocks are implemented in the system programmable logic area (user area). They are generally flexible in order to meet user needs.
  – Hard Core: IP logic are embedded in the FPGA device. They have limited flexibility or none at all.
Microsemi RTG4 FPGA and Its Embedded IP Cores

Figure does not show user programmable logic area.

Figure is courtesy of Microsemi
Soft IP Core Insertion Flow

1. Select or Buy IP Core
2. Insert IP core into FPGA programmable logic
3. Synthesize Design to produce gate-level-netlist
4. Place and route gate-level-netlist

HDL Hardware description language

Soft IP can can in the form of HDL or gate-level-netlists.

Create Configuration File
Pros of IP Core Insertion

- IP Cores are very easy to use.
- As an example, a computer system can be designed in minutes by simply pressing buttons within a CAD tool.
- Students are graduating with IP core insertion experience.
- Design development costs less:
  - Lots of complexity with very little effort.
  - Design cycle time.
  - Reusability reduces verification effort.
  - Employees require less expertise, hence less of a paycheck.

For complex, critical applications, the assumptions that IP cores will cost less can be a myth.
Cons of IP Core Insertion in Critical Applications

- IP Cores have limited visibility:
  - Difficult to verify and manipulate.
  - Design might not follow proper design rule protocol (but you will not know).
- If mitigation is required, it can be compromised.
- Design development costs less???:
  - Design cycle time can be elongated because selected user mode is not mainstream. Never used/tested before.
- Reusability can be compromised:
  - Once an IP is custom configured, it is no longer “reusable logic.” For critical application standards, verification effort is increased.
  - Once an IP is inserted into a unique design it is no longer “reusable logic.” For critical application standards, verification effort is increased.
Challenges: IP Core Insertion in Critical Applications

- Beware – pushing a button on a CAD tool can be misleading.
- Does the core follow proper synchronous design methodology?
- How has the design been vetted and verified prior to your usage?
- Research must be performed in order to understand if the IP can reliably be inserted into your design:
  - Timing characteristics – can the IP perform at the missions specified speed?
  - Can the IP core fit into the device with all other necessary logic?
  - Are the I/O of the IP compatible with your device or the other components you have in your device?
  - Does the IP require mitigation?
Challenges: IP Core Verification in Critical Applications

• Design reviews require design to be parsed by a team of specialists.
  – Some IP cores are so complex, they are close to impossible to parse.
  – Some IP cores are in gate-netlist form instead of HDL. They are also close to impossible to parse.
  – Some IP cores are locked and cannot be viewed by any individual.

• Although datasheets are available, users will rely on IP core models and blind testing.

• Point is, because of limited visibility and complexity, IP are hard to verify.

• Enhanced verification techniques exist but still have limitations regarding black box (like) IP.
IP Core Mitigation in Critical Applications: Dual Redundancy (DR) and Triple Modular Redundancy (TMR)

Stop, investigate, note limitations before pushing that CAD BUTTON!!!!!!!
Dual Redundant IP Cores

- There are no correction mechanisms.
- If the DR comparator detects a bad compare, the system stops and action is taken.
- **Pro**: if designed correctly, the system can be masked from IP core failures.
- **Con**: the probability of failure (hardware-reliability or single event upset (SEU)) is at least doubled.
  - Although the system can be masked, system availability is decreased.
  - Depending on the critical application, the reduction in availability can compromise adhering to mission requirements.
How To Insert TMR into A Design:
FPGA User Design Flow

Output of synthesis is a gate-level netlist that represents the given HDL function.

TMR can be written into the HDL. Generally not done because too difficult.

TMR can be inserted during synthesis or post synthesis.

If inserted post synthesis, the gate-level netlist is replicated, ripped apart, and voters + feedback are inserted.
Various TMR Schemes: Different Topologies

Block diagram of block TMR (BTMR): a complex function containing combinatorial logic (CL) and flip-flops (DFFs) is triplicated as three black boxes; majority voters are placed at the outputs of the triplet.

Block diagram of local TMR (LTMR): only flip-flops (DFFs) are triplicated and data-paths stay singular; voters are brought into the design and placed in front of the DFFs.

Block Diagram of distributed TMR (DTMR): the entire design is triplicated except for the global routes (e.g., clocks); voters are brought into the design and placed after the flip-flops (DFFs). DTMR masks and corrects most single event upsets (SEUs).
IP Cores and Block Triple Modular Redundancy: BTMR

- Need Feedback to Correct
- Cannot apply internal correction from voted outputs
- If blocks are not regularly flushed (e.g., reset), Errors can accumulate – may not be an effective technique

3x the error rate with triplication and no correction/flushing

Can Only Mask Errors
Explanation of BTMR Strength and Weakness using Classical Reliability Models

<table>
<thead>
<tr>
<th>Reliability for 1 block ($R_{\text{block}}$)</th>
<th>Reliability for BTMR ($R_{\text{BTMR}}$)</th>
<th>Mean Time to Failure for 1 block ($MTTF_{\text{block}}$)</th>
<th>Mean Time to Failure BTMR ($MTTF_{\text{BTMR}}$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>$e^{-\lambda t}$</td>
<td>$3 e^{-2\lambda t} - 2 e^{-3\lambda t}$</td>
<td>$1/\lambda$</td>
<td>$(5/6 \lambda) = 0.833/\lambda$</td>
</tr>
</tbody>
</table>

Reliability across Fluence: Simplex System versus BTMR Version

- System No TMR
- BTMR System

Operating a BTMR design in this time interval will provide an increase in reliability.

However, over time, BTMR reliability drops off faster than a system with No TMR.

$$\lambda = \frac{\text{Failures}}{\text{Time}}$$
BTMR Bottom Line

• How long does your BTMR system need to operate relative to the MTTF for one of its unmitigated blocks?
• Over time, a BTMR system is less reliable than an unmitigated system.
• Adding more replicated blocks (e.g., N-out-of-M) system will only increase the reliability during the short window near start time. However, overtime, the reliability of an N-out-of-M system will fall faster as M (the number of replicated blocks) grows.
• Unfortunately BTMR is the most common means of TMR used with IP cores. Users are not getting the level of protection that they require.
GAIA is an European Space Agency (ESA) mission that is expected to chart our Galaxy

**Performance Is Lower than Assumed**

- **SCS750 BTMR μPs (Maxwell)**
- **GAIA**

- **PowerPC**
- **PowerPC**
- **PowerPC**

**TMR Mitigation**

- **System Controller**
- **Memory controller**
- **PCI, Timers, Interrupts**
- **DMA, UART (2)**

**Radiation Tolerant FPGA**

- **SDRAM 256MB**
- **EEPROM 8MB**

**Reed Solomon and ECC**

**32 bit 33MHz PCI**

(1) The faulty μP is masked;
(2) System stalls and then is flushed;
(3) Bring system up with all μP synchronized.
DTMR and LTMR Strategies Provide Correction and Hence Increase Availability and Reliability

- Depending on the target FPGA, DTMR or LTMR can be suitable mitigation strategies:
  - LTMR for Microsemi FPGA products (Do not use in SRAM based FPGAs)
  - DTMR for SRAM based FPGA products (e.g., Xilinx).
- Depending on your TMR insertion tool, some IP cores can have LTMR or DTMR inserted during the synthesis process.
- Most tools are still having problems with TMR insertion into IP. This is another reason why BTMR is so popular… it’s simple to implement.
- Warning, there are some IP cores that are black boxes and no tool can insert LTMR or DTMR.

Concerns to be taken into account prior to IP selection.
Acknowledgements

- Some of this work has been sponsored by the NASA Electronic Parts and Packaging (NEPP) Program and the Defense Threat Reduction Agency (DTRA).
- Thanks is given to the NASA Goddard Radiation Effects and Analysis Group (REAG) for their technical assistance and support. REAG is led by Kenneth LaBel and Jonathan Pellish.

Contact Information:
Melanie Berg: NASA Goddard REAG FPGA Principal Investigator: Melanie.D.Berg@NASA.GOV