







Hardened Electronics and Radiation Technology (HEART) 2023 Tutorial - Complexity, Testability and Single Event Effects (SEE): Test and Assurance Considerations

Michael Campola NASA-GSFC and

Ken LaBel SSAI, Inc., JHU/APL, Trusted Strategic Solutions LLC, KAL Electronics for Space michael.j.Campola@nasa.gov, kenneth.a.label@nasa.gov, Kenneth.LaBel@jhuapl.edu

#### Acronyms and Abbreviations

- Atomic Mass Unit (amu)
- AWS: Amazon Web Services
- Bump Plating Photoresist (BPR)
- Chip to Wafer (CtW)
- CL: Confidence Level
- CMOS: Complementary metal-oxide semiconductor
- Commercial Off The Shelf (COTS)
- Complementary Field Effect Transistor (CFET)
- ConOps: Concept of Operations
- continuous wave (CW)
- DDD: Displacement Damage Dose
- Design Technology Co-Optimization/Synthesis Technology Co-Optimization (DTCO/STCO)
- Dynamic Random Access Memory (DRAM)
- EDAC: Error Detection and Correction
- EEEE: Electrical, Electronic, Electromechanical, and Electro-optical
- embedded Dynamic Random Access Memory (eDRAM)
- EMI: ElectroMagnetic Interference
- Extreme Ultraviolet Lithography (EUV)
- Ferroelectric Field Effect Transistor (FeFET)
- Ferroelectric Random Access Memory (FeRAM)
- Ferroelectric Tunnel Junction (FTJ)
- FET: Field-Effect Transistor
- FPGA: Field Programable Gate Array
- Fully Self Aligned Via (FSAV)
- Grand Accélérateur National d'Ions Lourds (GANIL)

- GSI Helmholtz Centre for Heavy Ion Research (GSI)
- GSN: Goal Structuring Notation
- High Bandwidth Memory (HBM)
- Hi-Rel: High Reliability
- Input/Output (I/O)
- Integrated Circuits (ICs)
- Josephson Junction (JJ)
- Lawrence Berkeley National Laboratories (LBNL)
- Linear Energy Transfer (LET)
- Magnetoresistive Random Access Memory (MRAM)
- MBU: Multi-Bit Upset
- Micro Three Dimensional (M3D)
- MOSFET: Metal-on-Silicon Field Effect Transistor
- Nanoelectromechanical Systems (NEMS)
- NASA Space Radiation Laboratory (NSRL)
- Negative Capacitance Field-Effect Transistor (NCFET)
- NESC: NASA Engineering and Safety Center
- NOT-AND (NAND)
- Phase Change Memory (PCM)
- Radiation Hardness Assurance (RHA)
- RDM: Radiation Design Margin
- Redistribution Layer (RDL)
- Resistive Random Access Memory (ReRAM)
- Return on Investment (ROI)
- R-GENTIC: Radiation Guidelines for Notional Threat Identification and Classification
- SEAM: System Engineering and Assurance Modeling
- SEB: Single-Event Burnout

- SEE: Single Event Effects
- SEECA: Single Event Effects Criticality Assessment
- SEFI: Single-Event Functional Interrupt
- SEGR: Single-Event Gate Rupture
- SEL: Single-Event Latch-up
- Self-Aligned Gate Contact (SAGC)
- SET: Single-Event Transient
- SEU: Single-Event Upset
- Single Diffusion Break (SDB)
- Single Event Effect Symposium/MilTutorialy and Aerospace Programmable Logic Devices Workshop (SEEMAPLD)
- Statistical Variability (SV)
- Structural Simulation Toolkit (SST) Random Access Memory (RAM)
- STTR: Small Business Technology Transfer
- Super-steep Slope (SS)
- Texas A&M University (TAMU)
- Three Dimensional (3D)
- Through Silicon Via/Through Mold Via/Through Die Via (TSV/TMV/TDV)
- TID: Total Ionizing Dose
- TMR: Triple Modular Redundancy
- TNID: Total Non-Ionizing Dose
- Tunnel Field Effect Transistor (TFET)
- Vertical Field Effect Transistor (VFET)
- Wafer-To-Wafer (WTW)

#### Outline for this tutorial

#### . Radiation Hardness Assurance (RHA) fundamentals

- Iteration over the project lifecycle
- How assurance and testing work together
- Types of radiation effects and how they scale with complexity
  - Cumulative: ionizing and non-ionizing dose
  - Instantaneous: single particle effects
- . Single Event Effects (SEE) test Considerations
- . Key analysis parameters to consider after test
  - Ways system architecture can be used to help mitigate radiation effects
  - Parameterizing available information
- Common pitfalls, lessons learned
  - Part database resources and how to use them wisely
  - Radiation tools / resources

#### Natural space radiation environment

#### Solar Maximum / Minimum

- Solar Flares
- Coronal Mass Ejections



- Radiation Belts
- Geomagnetic Storms
- Galactic
   Cosmic Rays

#### The Natural Space Environment and Single-Event Effects



www.nasa.gov

# Free-Space Particles – The Hazard for SEE: Galactic Cosmic Rays (GCRs) or Heavy Ions

#### Definition

- A GCR ion is a charged particle (H, He, Fe, etc)
- Typically found in free space (outside the earth's magnetic field)
  - Energies range from MeV to GeVs for microelectronics interest
  - Origin: can be created in supernova, but other sources may exist as well
- Important attribute for electronics
  - Energy deposited (lost) by the particle as it passes through a semiconductor material.
  - This is known as Linear Energy Transfer or LET (dE/dX).
  - Heavy ions direct ionization
  - Protons (mostly) indirect ionization (secondary reactions deposit energy)

Testing looks at energy deposition risk based on environment and device sensitivity



#### Radiation Hardness Assurance (RHA) overview

RHA consists of all activities undertaken to ensure that the electronics and materials of a space system perform to their *design* specifications throughout exposure to the mission space environment

(After Poivey 2007) ↑ (After LaBel 2004) →

These are the radiation engineer's functions and objectives that develop over the mission formulation and implementation phases.



#### RHA relationships

- RHA has always been a system-level practice
- Part characterization and application are evidence
- Modeling, test, and analysis work together
- Trades are used to achieve mission goals and meet requirements
- Early involvement is key



### Timing is everything



- Do you know where you are in mission/project phase?
  - Mitigation changes or replacement may not be an option if too late
  - Anomaly resolution testing will look a lot different than screening for SEE signatures
  - Determines your test needs for risk acceptance vs. risk avoidance
- New technologies?
  - Necessitate early life-cycle testing
  - Limited funding: corner cases, limited characterization

### Analysis and testing work together

Radiation testing gives evidence that application and mitigation will provide the assurance needs for a given mission, aids in risk quantification and requirements verification.

#### Testing and RHA

- RHA is an iterative process
  - Simplified view ->
  - Can tell you when more testing is necessary
  - Can inform what type of test is most relevant
- Testing plays a crucial role
  - Risk identification
  - Requirements
  - Screening
  - Mitigation strategy
  - Technology insertion
  - Verification
- Testing Impact
  - Not always obvious
  - Informs future decisions
- Test Results can help
  - Device/Design changes
  - Mitigation acceptance













45

#### RHA's approach to quantifying risk

Environment modeling and transport

Free-field environment

Shielding analysis

Internal environment

**Known Hazard** 

Analysis and test

Physics of failure

Signatures / characteristics of effects

Implementation / application

**Known Risk** 

#### Breaking down the different types of effects

# Ionizing Radiation Effects

Total lonizing Dose (TID)

Total Non-lonizing Dose (TNID)

Primarily high-energy protons and heavy ions

Single-Event Effects (SEE)

Non-Destruct ive

Destructive

### Types of radiation effects – Total Ionizing Dose (TID)

#### Cumulative effect

- Electron-hole pair creation and collection
- Electric field impacts drift and diffusion
- Oxide thickness and manufacturing plays a role in technology response
- Interface traps and oxide traps collect charge permanently
- More imperfections result in easier charge trapping
- Residual shift in static operation

#### Processes Involved in TID Damage



F. B. McLean and T. R. Oldham, Harry Diamond Laboratories Tech. Report, 1987. T. R. Oldham and F. B. McLean, *IEEE TNS*, 2003.

www.nasa.gov

### Types of radiation effects – Total Non-Ionizing Dose (TNID)

- Cumulative effect
- Primary knock-on atoms displace lattice and leave damage clusters
- Changing fundamental properties like carrier mobility means that optoelectronics are the most susceptible
- Some damage sites are so great that can lead to one hit failures within component functions (RTS, hot pixels, etc.)



After C. J. Marshall, 1999 IEEE NSREC Short Course.

#### SEE in a p-n junction

- Ions traverse device, depositing energy along their path
- Electron-hole pairs produced
- Deformation of the depletion region if a junction is hit
- Recombination dominates
- Diffusion and drift driven by electrostatics within device
- Dimensions and materials of device are crucial in signature response



Reverse-biased N+/P junction

R.C. Baumann, 2013 NSREC Short Course

www.nasa.gov

### Types of radiation effects – Single Event Effects (SEE)

#### Destructive

- SEL Latchup
- SEB Burnout
- SEGR Gate Rupture
- SEDR Dielectric Rupture
- SEU Upsets can become stuck bits

#### Non-destructive

- SET Transients, can be analog and digital
- SEU Upsets, can happen in multiple bits/cells -MBU
- SEFI Functional Interrupts, for complex devices, typical category for response that needs refresh/reset/power-cycle to return to operation
- Non-destructive does not mean non-disruptive



www.nasa.gov

### Scaling and sensitive volumes













### CMOS Technology Trend

For CMOS in general, the scaling of feature size is increasing resilience with respect to dose and **increasing the susceptibility** to single event effects.



P. E. Dodd, M. R. Shaneyfelt, J. R. Schwank and J. A. Felix, "Current and Future Challenges in Radiation Effects on CMOS Electronics," in IEEE Transactions on Nuclear Science, vol. 57, no. 4, pp. 1747-1763, Aug. 2010, doi: 10.1109/TNS.2010.2042613.

NASA Shared Services Center (NSSC) authorized limited use.

# Complexity, Testability and Single Event Effects (SEE): *Test* and Assurance Considerations

Kenneth A. LaBel SSAI, Inc., JHU/APL, KAL Electronics for Space

kenneth.a.label@nasa.gov, Kenneth.LaBel@jhuapl.edu



#### Outline

- Then and Now
  - o What's Changed in Last 40 Years
- Why are You Testing?
  - o Objectives for a SEE Test
- What is a "Complex" Device
- Know Your Device Expectations for Your Test Set
  - Start with the Datasheet
  - Review ALL Relevant Information (Do Your Research)
  - Error Signatures, Rates, and Recovery (During a Test)
- Know Your Beam Picking a Facility and Planning Test Campaign
  - o Practical physical, electrical, ...
  - Beam properties
    - » Diatribe: part thinning/deprocessing
- Data Capture and Statistics
  - Test Conditions
  - Test Coverage (geographic, temporal)
  - Event Interference
- Is That Really a Curve?
  - When Data Looks "Weird"
- What Management Needs to Know in the Aftermath
- Summary



#### **Ken's first CPU!**

#### 650x Processor

- 8 um feature size (not a typo) ~1975
- » 8-bit CPU
- » Up to 14 MHz
- » 64 KB RAM
- » 256 bytes stack
- » No I/O ports
- » 28 or 40-pin DIP

# THEN AND NOW

#### Back Then...

- Devices were simple
  - Transistors
  - Memory Arrays (4 kb SRAM!), 8 bit CPUs, and so forth...
  - High speed was 10 MHz operation
- Technologies were large and mostly silicon
  - >0.5 um (some >2.0um) CMOS feature size
  - GaAs was emerging; RH was silicon on sapphire (SOS)
- Device packaging
  - Planar
  - Ceramic and a little plastic
  - Through-hole packages (i.e. Dual Inline Packages (DIPs))...

#### And Now...

- Devices are not simple (though "glue" is still needed)
  - o FPGAs, Multi-core SOCs heterogeneous
  - >>Gbit Memories (with built-in voltage conversion and microcontrollers)
  - Extreme resolution or operating speeds and integration (single devices replacing a whole card of devices from a decade or two earlier)
- Technologies are
  - <10nm CMOS feature size</p>
  - Proliferation of widebandgap (power, RF)
  - Fins and silicon-on-insulator (SOI) are in!
     Rad hard = by design (RHBD)
- Device packaging
  - Mix of planar (old school) and multidimensional (2/5/3D) packaging

For SEE testing – it was easy to access the die (delidding) with limited SEE signatures (homogeneous devices)

## Complexity: Device and Packaging



http://images.dailytech.com/nimage/4621\_21476.jpg

Many materials and structures Courtesy of Daniel Fleetwood, IEEE NSREC 2020 Short Course



**Even More Materials and Stacking**Courtesy of Doug Sheldon and Eric Suh, JPL

For SEE testing – Getting beam to sensitive portions of the device and knowing what happens to the ion as it transverses the device is challenging

# WHY ARE YOU TESTING?

# Single Event Effects (SEEs)

- An SEE is caused by a single charged particle as it passes through a semiconductor material
  - Heavy ions (GCR and solar)
    - » Direct ionization
  - Protons (trapped and solar >10 MeV) or neutrons for sensitive devices
    - » Indirect Ionization
    - » Nuclear reactions for electronics
    - » Optical systems, some newer electronics, etc are sensitive to direct ionization (peak ~ 1MeV protons)
- When it affects electronics
  - If the LET of the particle (or secondary) is greater than the amount of energy or critical charge required, an effect may be seen
    - » Soft errors such as upsets (SEUs) or transients (SETs), or
    - » Hard (destructive) errors such as latchup (SEL), burnout (SEB), or gate rupture (SEGR)
- Severity of effect is dependent on
  - type of effect and it's event signature
  - where (geographic) and when (temporal) it occurs
  - device application/system criticality

Destructive
event
in a COTS 120V
DC-DC
Converter



Courtesy NASA

# Reliability and Availability – The Basis for Mission Requirements

- Definitions
  - Reliability (Wikipedia)
    - » The ability of a system or component to perform its required functions under stated conditions for a specified period of time.

Will it work for as long as you need?

- Availability (Wikipedia)
  - » The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, *i.e.*, a random, time. Simply put, availability is the proportion of time a system is in a functioning condition. This is often described as a mission capable rate.

Will it be available when you need it to work?

- Combining the two drives mission requirements:
  - Will it work for as long as and when you need it to?

# SEE is a Derived Requirement from Availability and Reliability – Objectives Vary By SEE Test Type

#### Examples

- Mission specific testing (or application)
  - » Go/No-go: do you see an event at a given LET or not?
  - » SEE rates for availability/reliability for that mission environment
  - » Event signature capture for mission mitigation design, ...

#### Generic Test

- » A product qualification test for a Mil/Aero product.
- » Usually provides worst-case information for destructive events, but limited information for non-destructive (corner-cases/nominal): May require additional application-specific testing for missions.

#### Characterization

- » Technology or architecture research
- System/Assembly Level (or System on a Chip SOC,...)
  - » Mitigation validation
  - » Dominant failure mode identification,...



# Not All SEE Testing is Done with the Same End Goal



Point is that ALL tests have a requirement and should be planned accordingly



# SO, WHAT IS A COMPLEX DEVICE?

## Complexity Comes from a Mix of Considerations

- Devices have evolving functionality, performance, and materials
  - Multi-technology (e.g., integrated optics)
  - New architectures (gate all around GAA, nanowires,...)
  - Will increased integration ever stop? Al, robotics,...
    - » Keyword: heterogeneous
- Technologies silicon and ?
  - A few electrons only needed to switch states
  - Use of SiGe, graphene, carbon nanotubes, ultra widebandgap,...
- Device packaging
  - Integration, integration
- Systems...



Figure 5.9: Drivers and technologies for better power, performance, area, and cost Scaling™ (courtesy of Robert Clark, Tokyo Electron)

#### https://www.src.org/about/dec



Nhanced-semi.com

# **KNOW** YOUR DEVICE

# For SEE Testing, Start with a Datasheet

- In all truth, for initial planning, using the functional block diagram is a good start
  - Review each functional block
  - Determine potential SEE types
    - » Upset (SEU), transient (SET), stuck bit, ...
  - Estimate error propagation and signatures
- Next step is to figure out data capture
  - o How will you observe the event?
  - Considerations for event recovery

It's important to understand limitations of data capture as well as "dominant effects" (usually, the large physical blocks within a device like memory arrays).

The occurrence of dominant effects may hide (mask) other effects during a test run due to the accelerated nature of the beam.



https://www.xilinx.com/support/documentation/data\_sheets/ds190-Zynq-7000-Overview.pdf

### Sample block analysis and post-test recommendations

| Chip Area          | SEE Issue                                                                                               | Possible SEU Mitigation                                                                  |
|--------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| Config. Memory     | Single and multiple bit errors corrupting circuit operation, causing bus conflicts (current creep), etc | Scrubbing     Partial reconfiguration      Partitioned design                            |
| Config. Controller | Improper device configuration can occur if hit during configuration/reconfiguration                     | Partitioned design     Multiple chip voting (Redundary by using multiple devices)        |
| CLB                | Logic hits and propagated upsets caused by transients                                                   | Triple modular redundancy (MR) (or Xilinx TMR – XTMR)     Acceptable error rates         |
| BRAM               | Memory upsets in user area                                                                              | TMR     Error Detection and Correction (EDAC) scrubbing                                  |
| Half-latches       | Sensitive structure used in configuration/routing                                                       | Removal of half-latches from design                                                      |
| POR                | SEUs on POR can cause inadvertent reboot of device                                                      | Multiple chip voting (Redundancy by using multiple devices)                              |
| ЮВ                 | SEUs can cause false outputs to other devices or inputs to logic                                        | everage Immune Config. Memory cell Evaluate input SET propagation                        |
| DCM                | Can cause clock errors that spread across clock cycles                                                  | TMR Temporal TMR                                                                         |
| DSP                | Hard IP that is unhardened that can cause single event functional interrupts (SEFIs) or data errors     | •TMR •Temporal TMR                                                                       |
| MGT                | Gigabit transceivers. Nits in logic can cause bursts or SFNs. O/w bit errors in data stream             | • TMR • Protocol re-writes                                                               |
| PPC                | Hard IP that is unhardened. SEFIs are prime contenn                                                     | TMR or software task redundancy                                                          |
| SEL                | Higher current condition that is potentially damaging                                                   | No mitigation other than substrate addition (epi).     Circumvention techniques possible |

LaBel, GOMAC 2007

#### But What Else Can I Use to Pre-Plan SEE Tests?

- The idea is to review the factors that can affect SEE for the device under test (DUT) through data diving (aka similarity)
- The table illustrates some characteristics that may affect SEE sensitivity and signature types (though significance varies by device)
- The idea is to find part info (foundry, technology,...) and similarity data (family, architecture, ...) and utilize for estimates for beam parameters (ions/LETs/flux) and test system operation, aka data capture (event signatures, rate of capture,...)

| Characteristics           | Descriptions                                                                      |  |
|---------------------------|-----------------------------------------------------------------------------------|--|
|                           | Manufacturer of the active semiconductor portion of the device. Example,          |  |
|                           | GlobalFoundries. The "same" product built at different foundries may have         |  |
| Foundry                   | significantly different SEE characteristics.                                      |  |
|                           | Technology and specific fab process within a foundry/manufacturer. Ex., bipolar   |  |
| Process                   | technology built on XKQD process. May eliminate or add some SEE concerns.         |  |
|                           | Geometric transistor/cell size or similar. How big individual targets are for SEE |  |
| Feature Size              | ion strikes. More of a cross-section than threshold issue.                        |  |
|                           | Potential known variance by lot, wafer, etc of a product. Usually not a           |  |
|                           | dominant contributant to affect threshold/cross-section, but has been             |  |
| Wafer/lot/package         | observed.                                                                         |  |
|                           | # of potential targets for ion strikes. Usually more of a cross-section than      |  |
| # of transistors/cell/etc | c/cell/etc threshold concern.                                                     |  |
| Die size                  | Target area for SEE risk. Usually affects cross-section more than threshold.      |  |
|                           | Is there any known SEE sensitivity/data on a specific manufacturer's product      |  |
| Family                    | family?                                                                           |  |
|                           | Is there related information on any parts with similar architecture? Consider for |  |
|                           | example, Buck Regulator architecture for power conversion. Types of SEE event     |  |
| Architecture              | signatures may be gleamed.                                                        |  |
|                           | Have devices with similar IP been tested (or perhaps partial on device of         |  |
| Functional Blocks/IP      | interest)?                                                                        |  |
| Operating                 |                                                                                   |  |
| characteristics           | How much does the specific operating conditions affect the SEE response?          |  |
| (frequency, voltage,      | Simple examples: dV for transients in an op amp or frequency for SET capture      |  |
| etc)                      | in a shift register string. Application specific test needs versus data found.    |  |
| Other                     | Specific device types or technologies may have additional considerations.         |  |

## **To Clarify**

- We do these things to
  - Estimate Error Signatures that the test set needs to capture,
  - Anticipate Event Rates to set data capture rate capabilities and beam flux, and,
  - To maximize efficiency (time lost) for Event Recovery.
    - » The point is to return to a known state in a deterministic manner after the event to allow the test run to continue.
    - » Keep in mind that you'll need to factor in the beam time on/off to normalize results.



# Let's be Very Clear

## Each of the available heavy ion facilities are different

- Kinetic energies
- lons available
- Beam control (flux, etc...) and reporting
- Vacuum vs open-air test fixture (and time to change DUT)
- Beam structure
- Cabling
- lon/energy tune capability and time to change
- Target room interlock systems and time to enter/exit
- And so on...

It is incumbent on the test team to be familiar with the chosen test facility (and their resources)

A pre-test visit is HIGHLY recommended

#### Sample Considerations for Using a Test Facility

#### Particle

- Test energies/ions
- Dosimetry/particle detectors
- Uniformity
- Particle range
- Spot size/collimation
- Test levels
  - » Flux and fluence rates
  - » Beam stability
- Particle localization
- Beam structure pulsed vs continuous
- Secondary particles

#### Practical

#### o Technical

- » Mechanical/mounting
- » Cabling/feedthroughs Ethernet, Wi-Fi,...
- » Power
- » Ancillary test equipment location (in vault or user area)
- » Test specific issues

Thermal
Speed/performance

Test conditions

#### Logistics

- » Contracts/purchase
- » Safety rules (patients first) Personal dosimeters?
- » Shipping/receiving
- » Staging/user areas
- » Operator model
- Activated material storage

Normally test groups bring multiple samples of the same device (statistics/backup) and multiple devices to a test campaign

### Kinetic Energy Matters – Trade Space

- The two prime things we care about are
  - Penetration range (testability) - Y-axis
  - LET coverage X-axis
- The figure at right shows the trade space that higher energy (penetration) equates to lower LETs

What is the LET after passing through the device at the active region?



Courtesy Sivertz/BNL

#### Natural Space Environment – Heavy Ion Coverage:

#### Plenty of penetration to cause SEE



Courtesy of Vanderbilt https://creme.isde.vanderbilt.edu/



#### Typically:

- higher LETs destructive events
- lower LETs soft commercial devices Z

#### Selecting a Facility – Kinetic Energy Matters



Micron's proprietary CMOS-under-Array technique constructs the multilayered stack over the chip's logic, packing more memory into a tighter space and shrinking 176-layer NAND's die size, yielding more gigabytes per wafer.

Courtesy of Micron, https://www.eetimes.com/micron-leapfrogs-to-176-layer-3d-nand-flash-memory/#

## This is a notional figure.

**Key is ensuring that** you select proper energy regime based on sufficient penetration of the ion to reach sensitive/active portions of the device. **Device physical** material crosssectioning and modeling (SRIM, for example) of the "stackup" is required.

#### If High Energy isn't Available or Higher LETs are Needed:

#### Remove Material

- Delidding/deprocessing/thinning of the device may be necessary to ensure adequate penetration range to sensitive portions of the device
  - Be aware this is a destructive process and whoever performs the deprocessing should be aware that a fully functional device after deprocessing is the goal





Two examples of deprocessing yield failures
Cracks (top) and Waffling (bottom)
LaBel, GOMAC 2007

#### I'm Going to Need a Bigger Beam!

- Consider System SEE Testing as a two-step process
  - 1. Test of devices to identify error signatures/dominant event types
    - » Utilize information for device selection and to design SEU tolerance into the system
  - 2. Test of the system to evaluate design/mitigation performance (keeping in mind that it is an ACCELERATED test versus space particle rates)
    - » In essence, this is using the beam as a fault injector
- Step 1 treats the test as we're used to: irradiate a single
   IC at a time
- Step 2, however, has options
  - Inject faults into an individual device/module at a time or
  - Increase beam size to irradiate entire assembly (or portion thereof)
    - » Currently, NSRL is the only domestic facility with this capability







# DATA CAPTURE AND STATISTICS



# There's Usually an Application Specific Nature to a SEE Test

- Lots of possible test modes, conditions, patterns, etc. as seen in the figure
- One needs to be very wary of ensuring the test will encompass even your worst-case mission application of the device

#### Can we test anything completely?



#### Commercial 1 Gb SDRAM

68 operating modes operates to >500 MHz Vdd 1.8V external, 1.25V internal

#### Sample Single Event Effect Test Matrix

#### full generic testing

| Amount | ltem .                              |
|--------|-------------------------------------|
| 3      | Number of Samples                   |
| 68     | Modes of Operation                  |
| 4      | Test Patterns                       |
| 3      | Frequencies of Operation            |
| 3      | Power Supply Voltages               |
| 3      | lons                                |
| 3      | Hours per Ion per Test Matrix Point |

66096 Hours

2754 Days

**7.54** Years

and this didn't include temperature variations!!!

Test planning requires much more thought in the modern age as does understanding of data collected (be wary of databases).

Only so much can be done in a 12 hour beam run – application-oriented

Scaled CMOS Test Challenges - Presented by Kenneth A. LaBel, GOMAC Conference, Orlando, Fl 3/22/07

.

# Concept for a "Generic" SEE Test

Your mileage might vary (this is just a concept!)

Key is bounding!

Step 2 in figure is **corner cases/nominal** as representative generic study recommendations

#### **Product Integrity SEE Test (PISEET)**

General concept is along the lines of the package integrity demonstration test plan (PIDTP)

#### Step 1:

 Test relevant test structure to determine specific issues such as temperature, angle, voltage, SET propagation, etc.... If no test structure available, document test factors required based on technology and architecture (else, lots more testing).

#### Step 2:

 Test product in 3 ways: minimal/no operation, mid-level resource loading, maximum resource loading (or as high as practicable). Test plan would require review and approval (like with packaging version...).

#### Step 3:

 Provide caveat that this is not a product qualification or guarantee and that specific applications/risk postures may require additional testing

Beam characteristics also to be discussed

To be presented by Kenneth A LaBel at the virtual JEDEC (originally Joint Electron Device Engineering Council), September 16, 2020.

#### Test Coverage – How Much Beam and Ions are Enough?

- A quick sidebar/reiteration
  - SEEs are function of where the energy is deposited (geometric consideration) and when the energy is deposited vs circuit operation (temporal consideration)
  - Geometric is easy to understand: either a transistor is hit or not. Classic example is a bit flip in a static memory array
  - Temporal is a bit more complicated. The classic example here is a clocked latch: depending when the charge is deposited vs the "sensitive time window" of the clock edge (i.e., when that transient would propagate to a change of state or when it wouldn't)
    - » This was a simplistic example. Consider the question of when some transistor gets an ion hit in a system-on-a-chip (SOC) versus the myriad of potential operations happening
- The discussion revolves around **reasonable** statistical coverage for geometric and temporal concerns and a few other factors...

# Let's Start at the Beginning

# What's All This Fluence Stuff, Anyhow?



#### Fluence is:

 The number of particles impinging on the surface of a device during a single ion beam test run normalized to a square centimeter. Denoted F.

#### It is NOT:

- Cumulative fluence: the sum of all individual fluence levels for all beam runs (usually only for a given ion, energy, and angle).
- Effective fluence: beam run fluence normalized by cos(θ), where θ is the angle of incidence.



# What's My Number?

#### The Challenges



- There are four basic considerations for determining fluence levels:
  - Geometry:
    - The number of potentially sensitive nodes or transistors in the device (statistical node coverage).
  - Operation (and propagation):
    - The dynamic operation of the device under test (statistical state and error propagation coverage).
  - Sample size:
    - The number of samples of the device being used in the system (statistical system coverage).
  - POF and (more) statistics:
    - The environment exposure and particle kinematics (i.e., what happens when a particle strikes the semiconductor).
- Note, for dynamic operations we are often looking not only at measuring a cross-section, but determining as many possible error signatures as reasonable.
  - A simple example is the range of transients induced in an amplifier.

To be presented by Kenneth A. LaBel at the Single Event Effects (SEE) Symposium and the Military, and Aerospace Programmable Logic Devices (MAPLD). Workshop, La Jolla, CA, May 19-22, 2014.

### Geometric Coverage

#### Gee, I'm a Tree!

- This is the simplest of the challenges to discuss. So consider,
  - If a memory device under test (DUT) has a billion bits (Gbit), how many random particle strikes on the die surface are required to cover a sufficient number of potentially sensitive bits in order to obtain good statistics?
    - · 1%?, 10%?, 50%?, 100%?
  - Ask yourself, what is the objective?
    - · Mean distribution?
    - Corner cases?
  - Suggest 10% at a minimum, but…
    - Remember there's timing involved (more to come next)...



#### This is a figure depicting ion beam particle interaction with a target at a cyclotron

Courtesy of Rod Nave http://hyperphysics.phyastr.gsu.edu/hbase/nuclear/imgnuc/crosec.gif

To be presented by Kenneth A. LaBel at the Single Event Effects (SEE) Symposium and the Military and Aerospace Programmable Logic Devices (MAPLD). Workshop, La Jolla, CA, May 19-22, 2014.

# Time is NOT on Our Side

- Test Like You Fly (TLYF) isn't quite what you think it is
  - It's using a representative application that provides the appropriate information for the actual flight utilization.
  - Remember that ground testing is an accelerated test (i.e., particles rates are extremely higher than during the mission) and the test setup needs to accommodate this complication. (kudos M. Berg)
  - More on this shortly.

#### **Dynamic Operation Constraints**



- State space issues: Assume that a particle strikes a specific location (sensitive node). What can happen?
  - An error can occur immediately,
  - An error can occur at a undetermined time (and/or location) later, or
  - Nothing.
- Why? Let's look at that Gbit memory.
  - How long might it take to cycle through the device memory space?
     Maybe a minute or so? Is it a simple form of propagation?
  - What if I'm writing over the memory space? Is it possible to clear errors by re-write and never detect them?
- Take, for example (courtesy Melanie Berg), a 32-bit counter.
  - There are 2<sup>32</sup> states.
  - Operational frequency of 50 MHz (20 nsec per state) over 300 billion seconds to cover all states.
    - · Not happening during a beam run.
    - Key is understanding the error signature space and propagation effects...
       (ask Melanie about "Test Like You Fly" not always best).
  - Remember, each state has the same random chance of taking a hit.
    - Consider a truly complex device like a system on a chip.
- Operating state coverage (statistics), and error signatures.

To be presented by Kenneth A. LaBel at the Single Event Effects (SEE) Symposium and the Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 19-22, 2014.

9

# How Many Devices to Test

- Most tests use 3-5 samples of a device
- Not all samples will necessarily be as "fully tested", but rather sufficient "common condition" test points (homogeneity determination)

#### (Sample) Size Matters



- Besides the usual discussion of statistical relevance of samples from a single wafer lot, consider what the test results will be applied to.
  - How many samples in the flight application are being used?
    - There's a big difference between flying two samples of a device and one thousand!
    - Outlier results are important when device is being used extensively. [1]
- It's also important to grasp the idea of limiting crosssection (i.e., no events observed).



How important is knowing outliers in SEE testing?

 K.A. LaBel, A.H. Johnston, J.L. Barth, R.A. Reed, C.E. Barnes, "Emerging Radiation Hardness Assurance (RHA) Issues: A NASA Approach for Space Flight Programs," IEEE Trans. Nucl. Sci., Vol. 45, No.6, pp. 2727-2736, Dec. 1998.

To be presented by Kenneth A. LaBel at the Single Event Effects (SEE) Symposium and the Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 19-22, 2014.

10

# The Mission Matters, But...

#### **Application Environment**

- Rule #1: Ground irradiation is a confidence test and not a precise risk definition process.
  - The test is being performed to "bound" a problem. In other words,
    - Test fluence levels are not meant to be the same as what a device will be exposed to, but to provide confidence that the risk will be less than X of occurring.
    - Remember, X can be based on a limiting cross-section when no events have been observed
      - Though not likely true, assume that the next particle that hits the DUT causes an event, so that the limit of the cross-section is ~1/F.
  - It is important to remember that a test fluence of two to ten times a mission predicted fluence only goes so far in reducing risk.
    - Higher levels should be considered (keeping in mind total dose concerns at the DUT level) for better risk reduction.
    - If a mission proton fluence (of energies of interest) is 10<sup>9</sup>, what does a test to 10<sup>10</sup> buy?

To be presented by Kenneth A. LaBel at the Single Event Effects (SEE) Symposium and the Military and Aerospace Programmable Logic Devices (MAPLD). Workshop, La Jolla, CA, May 19-22, 2014.

11

#### Ions, LETs, Angles, Energy -Planning

- Depending on the Mission Requirement, the upper end of the LET spectra used for test may vary
- The figure illustrates a selection of ions and angles to vary LET and get full coverage during testing
- Energy is another variable to modify test LET



Courtesy Megan Casey, NASA

# Measured Data on Complex Device - Caveat

## Modeling Empirical Cross-Sections from Beam Experiments





#### Empirical cross sections are not pure:

- P<sub>gen</sub>: physics, sensitive region, basic mechanisms. Generally what our models target. Probability that an ion strike will generate an SET/SEU
- P<sub>Effect</sub>: design, operation, frequency: Incorporates design dependent topology and frequency as a transfer function (H(s)). Given P<sub>gen</sub>, what is the probability that the system will be disturbed?
- Polserve: test system and test conductor. Probability that the system disturbance is observed. Goal is to capture and observe every event with Polserve = 1

#### • What is the goal of the experiment?

- If attempting to measure P<sub>gen</sub> (perhaps to compare to a model or perform basic mechanism research): P<sub>Effect</sub> and P<sub>observe</sub> must approach 1.
- If attempting to apply mitigation and measure its efficacy: P<sub>Effect</sub> should approach 0 and P<sub>observe</sub> must approach 1.
- No one test-type and analysis fits all.

Courtesy of Melanie Berg, SPACER2

#### Ken's key takeaway:

The capabilities of the test system need to be included in the interpretation of complex data sets.

This is especially true for those test devices with a large number of operational states and IP blocks (processors, FPGAs, SOCs) and cases where some events are missed due to another event "crashing" the device.

Remember flux(ground test) >> flux(space).

#### Observability and Capture –

- Start with a complex modern multi-million transistor, multiple embedded and soft IP device like at the right
- lons are randomly impacting across entire device (unless localization is done)
  - Any area may be "hit" at any time
- Operationally, not all areas of the device are active at one time nor are able to be interrogated "instantaneously" by a test system
- The "lag time" between the test system observability and when the particle actually impacted the device may cause either
  - o Incorrect measurements of fluence to the event or
  - Masked events
    - » Area 1 has an ion event but has not yet been interrogated by the test system
    - » Area 2 has an event that crashes the device and area 1 event never gets observed

#### Test Level Issues

- One of the difficult challenges in testing any modern complex device (processor, FPGA, SOC,...) is:
  - Events that "crash" the device occur so readily that providing "traditional" 1E7 SEL fluence levels can be a challenge
  - In other words, if SEFIs keep crashing the device, will we be able to:
    - » Obtain sufficient fluence levels for confidence?
    - » Mask potential SEL events or other SEU events?
  - The higher the blue screen of death (BSOD)
    rate, the harder it is to get to achievable SEL test
    levels
- Diatribe: high current <> SEL...
  - Be aware that there are a myriad of reasons (mostly circuit related SEUs) that cause increases in current consumption – BIG CHALLENGE!

Will it take ~100 test runs to get to an effective fluence of 1E7 if each SEFI crashes the device?



Fig. 8. SEFI cross-section as a function of ion effective LET for four different data rates and 5 paths through the switch. The solid line is a fit to the data using a Weibull function.

Stephen Buchner, et al, "Characteristics of Single-Event Upsets in a Fabric Switch (ADS151)" https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1442463

# IS THIS REALLY A CURVE?



# So, This is a Pretty Ideal Curve of Results

- Should be relatively easy to draw a Weibull curve for fitting (rate prediction)
- Complex Devices rarely look this neat and clean
  - Results are usually NOT homogeneous
- The following slide is a toplevel of why data might look weird (and is by no means a complete list)



#### Weird Data is a Talk Unto Itself

#### But Consider, for example

- Device complexity and layout
  - » Variety and complexity of error signatures
    Bit errors, power cycles, SEFIs, SELs, stuck bits,...
  - » Rare events

In other words, a less dominant SEE event type/signature.
Usually, it's a small physical IP or circuit portion of the overall device.

#### Facility/ion issues

- » Did the operator give you Neon or Nitrogen?
- » Is there a secondary ion being mixed with the prime ion?
- » Noise magnetic, RF, electrical,...
- » Flux rate did you design your data capture to meet expected events/sec?
- Materials in the Device
  - » As per the figure, the prime ion can interact with materials and cause a higher LET secondary (indirect ionization) event



Image courtesy of Vanderbilt University



# WHAT MANAGEMENT NEEDS TO KNOW IN THE AFTERMATH

#### Management Rarely Cares About Technical Details

While a well-documented test report is essential for configuration control (and trust me, even if the test is for "one" mission, someone will want it for another purpose later), the KISS method is usually best with management. They don't want a lot of numbers.

My personal responses tend to be similar to

- "It passed the go/no-go criteria"
- "The event rate is well below AVAILABILITY requirement"
- "We need to discuss with the design team mitigation or alternate options" (not one they want to hear)
- "Design team reviewed and the events are already accounted for in circuit operation (mitigation)", and so on...

Of course, it will depend on the management and test objective



# FINAL RECOMMENDATION AND CAVEAT

#### Create Your Own Checklist

- Create a priority approach of test objectives based on:
  - Device operating modes, voltage levels, frequencies, ...
  - Device physics
    - » Angles, ions, energies, ...
    - » Beam characteristics
- An early description of the checklist approach





# Are Current SEE Test Procedures Adequate for Modern Devices and Electronics Technologies?

Kenneth A. LaBel
Co- Manager,
NASA Electronic Parts and Packaging (NEPP) Program
NASA/GSFC
ken.label@nasa.gov
301-286-9936
http://nepp.nasa.gov

Lewis M. Cohn, Defense Threat Reduction Agency Ray Ladbury, NASA/GSFC

https://radhome.gsfc.nasa.gov/radhome/papers/HEART08\_LaBel\_pres.pdf

## And Finally...

## Radiation and SEE Testing are a "Black Art"

- This talk provided some of the reasons that it is
- It also really provided the limitations of ANY complex device SEE test

Experience and working with someone experienced is not a cure-all, but really does help

 Understand that even someone experienced is not an expert in all types of devices and SEE testing (widebandgap power vs SOC, for example)

Feel free to reach out if you have any questions









### Break













Complexity, Testability and Single Event Effects (SEE): Test and *Assurance* Considerations



Michael Campola (NASA-GSFC)

#### So you found some data? There are limitations

- Radiation testing is destructive
- At best you are flying parts from a well sampled/tested wafer run
- Test data applicability is paramount

## Caveat on Applying Information

 The relationship between your device and the precision of information available may impact what you might infer for testing



DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited

This is again notional and may vary by event type as well as by part type, technology, etc... Lower precision might indicate more conservatism in test planning (i.e., start at lower LETs, etc...)

#### Intentional test results can increase fidelity

- Bounding allows for engineering trades to be explored
- Complex devices need targeted applicationbased test campaigns



#### How do we assure something that is complex?

**Previous** Likelihood for Independent Manufacturer process and Physics of Criticality and Roadmaps & **Availability** Worst case verification Corner cases Maintainability environment conditions failure drivers Quantification Demand testing feature size constraints impacts conditions tests performance Requirements Reliability at **Preliminary** Concept of **Partnerships** Characterization verification and testing extrema **Operations** validation Technology Feasibility Architectural System Assurance (show-stoppers) **Demonstration** Radiation effects community driven: Design community, intentional test design driven Environment and end user or project/program

NASA Electronic Parts and Packaging (NEPP)

driven

#### How do you plan to mitigate?



Without information on the flight design, we don't know if we are in or out of bounds, this isn't just a parts selection question.

#### Primary system-level steps to mitigating for radiation



- Shield for TID/TNID, tolerate parametric drift, redundancy is only relevant if parts degrade slower when off (this is not the norm)
- Avoid destructive SEE at all costs, avoid unknown untested parts, this is the parts selection concern
- Anticipate non-destructive SEE signatures for a given family of devices, this is circuit/system design concern
  - Filtered power supplies
  - Redundant computers, hardened FPGA designs
  - EDAC on memories
  - Watchdog timers and autonomous resets
  - Power limiting to susceptible devices
  - Identify the risks, explore the possible consequences
  - Be able to power-cycle part/board/box if you don't know

# Damage is a two-fold problem

- Dose shows up as you'd expect: wear-out mechanism (cumulative)
   many damage sites or trapped charges accrue over time
- Single events show up as random failures-in-time (instantaneous)
  - one particle with sufficient energy deposition in the right location



# Redundancy may or may not solve the issue



# Mitigation techniques

| Mitigation Techniques | TID | DDD | SEE | Charging |
|-----------------------|-----|-----|-----|----------|
| Part Selection        | X   | X   | Χ   |          |
| Material Selection    |     |     |     | X        |
| Shielding             | X   | X   | (X) | X        |
| Operating Parameters  | X   | X   | (X) |          |
| CONOPS                | X   | X   | X   | X        |
| Circuit Design        | X   | X   | X   |          |
| EMI Design            |     |     |     | X        |
| TMR                   |     |     | X   |          |
| EDAC                  |     |     | X   |          |
| Watchdog              |     |     | X   |          |
| Cold Spare            | (X) | (X) | (X) |          |

Adoption of mitigation techniques occur throughout the lifetime of the satellite

# Deciding if you need to mitigate at all

#### **Error-Functional**

#### Error-Vulnerable

#### **Error-Critical**

- High number of SEE signature allowable
- Design may inherently be indifferent to SEE signature with mitigation in place or robust design practices
- Nuisance or manageable function impacts (e.g. filtered transients, error detection and correction on memories) beyond part responses
- No action needed

- Low number of SEE signature tolerable
- Design may require function for small window of availability or spend very little time in the susceptible state
- Mitigation needed in order to be reclassified as errorfunctional (e.g. SEFI of Flash, Multi-bit upsets)
- Ground or autonomous operations must be anticipated

- SEE signature not allowable
- Disruption of function identified as single point of failure or design cannot continue to perform after SEE
- Mitigation needed in order to be reclassified as error-vulnerable (e.g. destructive SEL, many error accumulation, boot image corrupted due to error accumulation, SEFI that requires ground intervention or box level reset waiting on ground)
- Anomaly review needed or loss of mission

# Mitigating with system architecture

#### **Hardness Assurance**

Severity Assessment (Device Technology + MEAL + ConOps)

Carry anticipated error collection and impact at higher level (e.g., ConOps, Contingency, FMECA, or WCA)

Remove Susceptible Components Within Function



# Criticality, Availability, Operability, Reliability

- Functional Analysis
  - Identify critical functions
  - Determine subsystems and components/structures across function
- Determine Criticality
  - Critical
  - Vulnerable
  - Functional
- Evidence & Trades
  - Test data
  - Mitigation or Maintenance



# **Example System/Function**

## Power Management Integrated Circuit

 Simple example would be current sense and switching on **BiCMOS** process

### System on Chip

- Complex Device Multi-function
- Highly Scaled CMOS mixed Analog and Digital signals
- Flash Memory
  - Dense storage (floating gate + CMOS)
  - Complex memory management and interface circuitry



# **Example System/Function States**

#### Safe-State

Power Management does not output supply voltage

## Loading Boot Image

- All devices powered
- Read operation from Flash Memory

### Operations

- System on Chip in heavy usage
- Flash memory powered but no read/write



## SEE Susceptibilities of Functional States

#### Safe-State

 Error accumulation could corrupt boot image

## Loading Boot Image

- Interrupts in System on Chip or memory during could invalidate
- Supply voltage dropouts

### Operations

- Bad commands could lead to hangs or locked states
- Corrupt data, packet loss, etc.
- Supply voltage dropouts



www.nasa.gov

# Device SEE Susceptibilities

#### Safe-State

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory storage cells susceptible to SEU

#### Loading Boot Image

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory susceptible to SEFI
- System on Chip susceptible to SEFI, SEU

#### Operations

- Power Management IC susceptible to SET, SEU, SEL
- System on Chip susceptible to SEFI, SEU



# Example Device/Function SEE Susceptibilities

#### Safe-State

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory storage cells susceptible to SEU

#### Loading Boot Image

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory susceptible to SEFI
- System on Chip susceptible to SEFI, SEU

#### Operations

- Power Management IC susceptible to SET, SEU, SEL
- System on Chip susceptible to SEFI, SEU



83

# Example Device/Function SEE Susceptibilities

#### Safe-State

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory storage cells susceptible to SEU

#### Loading Boot Image

- Power Management IC susceptible to SET, SEU, SEL
- Flash Memory susceptible to SEFI
- System on Chip susceptible to SEFI, SEU

#### Operations

www.nasa.gov

- Power Management IC susceptible to SET, SEU, SEL
- System on Chip susceptible to SEFI, SEU



# Scaling Based on Informative Inputs

**SAFE** 

| Device,<br>circuit, or<br>system | Power<br>Management<br>IC                                 | 1-10                  |
|----------------------------------|-----------------------------------------------------------|-----------------------|
| Factor                           | Detail                                                    | Severity<br>of Factor |
| Technology                       | BiCMOS                                                    | 5                     |
| Device<br>Complexity             | Low                                                       | 2                     |
| SEE Type                         | SET, SEU                                                  | 5                     |
| Functional<br>Analysis           | Supply voltage drop<br>out to other devices<br>downstream | 10                    |

| 0-1 Scaling o | f mission impact dur | ing usage state – ( | design rationale/ | 'influence |
|---------------|----------------------|---------------------|-------------------|------------|
|---------------|----------------------|---------------------|-------------------|------------|

| State Duration =<br>1 month | Temperature State = Low | State Duration = 5 minutes | Temperature State = Warm | State Duration = 6 months | Temperature State = Hot |
|-----------------------------|-------------------------|----------------------------|--------------------------|---------------------------|-------------------------|
| Severity Scaling            | Severity Scaling        | Severity Scaling           | Severity Scaling         | Severity Scaling          | Severity Scaling        |
| 0.8                         | 0.25                    | 0.9                        | 0.5                      | 1                         | 1                       |
| 1                           | 1                       | 1                          | 1                        | 1                         | 1                       |
| 1                           | 0.25                    | 1                          | 0.5                      | 1                         | 1                       |
| 1                           | 1                       | 0.1                        | 1                        | 1                         | 1                       |

**BOOT** 

|                   |                  |                    | Duty Cycle = On all the |                    | Duty Cycle = On all the |
|-------------------|------------------|--------------------|-------------------------|--------------------|-------------------------|
| Criticality = Low | Duty Cycle = low | Criticality = High | time                    | Criticality = High | time                    |
| Severity Scaling  | Severity Scaling | Severity Scaling   | Severity Scaling        | Severity Scaling   | Severity Scaling        |
| 0.25              | 0.25             | 1                  | 1                       | 1                  | 1                       |
| 0.25              | 0.25             | 1                  | 1                       | 1                  | 1                       |
| 0.25              | 0.25             | 1                  | 1                       | 1                  | 1                       |
| 0.25              | 0.25             | 1                  | 1                       | 1                  | 1                       |

Totals SAFE - 0.89 BOOT - 7.75 OPS - 22

**OPS** 

# Consistency is Key to Success

Devic circuit, syste

Facto

Technol

Devic Comple

SEE Ty

Functic Analys

www.nasa.gov

Scaling can be provided based on rationale and assumptions:

- Calculated/predicted upset rate during window or phase
  - Availability requirements
  - Non-impact
- Environment/Temperature dependence of SEE mechanism
  - Solar particle event or nominal
  - Elevated temp and Latchup
  - Stability of signal
- Device/circuit/system operation susceptibility to SEE
  - Read only mode
  - Sleep state
  - Duty cycle

Will be iteratively refined with more information and fidelity

|         |                    | воот                        |                           | OPS                  |        |  |
|---------|--------------------|-----------------------------|---------------------------|----------------------|--------|--|
| State = | Sta 9 Duration =   | Temperature State =<br>Warm | State Puration = 6 months | Temperature S<br>Hot | tate = |  |
| ng      | Severity Scaling   | Severity Scaling            | Severity Scaling          | Severity Scaling     | 5      |  |
|         | 0.9                | 0.5                         | 1                         | 1                    |        |  |
|         |                    | 1                           | 1                         | 1                    |        |  |
|         | 0.1                | State Duration = 5 minutes  | Temperati<br>Warm         | ure State =          |        |  |
| low     | Critic lity = High | Severity Scaling            | Severity So               | caling               | ll th  |  |
| ng      | Severit, Scaling   | 0.9                         | 0.5                       |                      |        |  |
|         | 1                  | 1                           | 1                         |                      |        |  |
|         | 1                  | 1                           | 0.5                       |                      |        |  |
|         | 1                  | 0.1                         | 1                         |                      |        |  |

# Aggregation Allows for Optimization

# Focus Resources

| Signature                                          | SAFE | воот  | OPS  |
|----------------------------------------------------|------|-------|------|
| Flash Errors accrued during safe state             | 28   | X     | ×    |
| Flash SEFI                                         | Х    | 16.6  | X    |
| SET/SEU dropout of supply voltage                  | 0.89 | 7.75  | 22   |
| SEL of Power Mgmt IC                               | 0.97 | 10.25 | 27   |
| SoC SEFI crash or hang in need of reinitialization | Х    | 21.3  | 38   |
| SoC SEU bad data                                   | Х    | 20.3  | 22.8 |

Will be an iterative solution with additional information: This can also be done for the internal blocks of a complex device

## Recommendations based on perceived risks

#### Safe-State

- Periodic scrubbing of Flash so that you don't end up with more errors than you can correct (Error-functional)
- Report rate needed for scrubbing may need cross-section (Error-vulnerable)

#### Loading Boot Image

- Before loading boot image successfully power cycle Flash (Error-vulnerable)
- SoC must be able detect bad image
- Must have cross-section on SoC in order to ensure successful operations

#### Operations

- Power Management IC susceptible to SET, SEU, SEL: anticipate this, you should consider time hit reinitializing the SoC if needed (Error-Vulnerable)
- System on Chip susceptible to SEFI therefore SoC must have watchdog to continue functioning (Error-Vulnerable)
- SEL for Power Management IC handled above subsystem



# Common pitfalls, lessons learned

- Thinking radiation is one number to meet
  - Dose profile behind different amounts of shielding also depends on the type of incident radiation
  - SEE that have low LET susceptibilities can benefit from some shielding, higher LET will be present
  - Bringing radiation engineering in late to the design process is not a good idea
- Tight tolerance in application
  - Not considering the dynamic environmental conditions
  - Derating is your friend
- Overly complex mitigation doesn't solve the problem
  - Verification of mitigation very well could require testing, and more money
  - Additional susceptibilities introduced into reliability overall
- Don't forget about other environment driven failures
  - Charging / Corrosion
  - Temperature
- Heritage? What heritage?
  - · Part to part variation, lot to lot variation
  - · Better predictor for dose performance if you have part fidelity
  - Not very good rationale for SEE





www.nasa.gov

## Recent NASA Guidelines

- Avionics Radiation Hardness Assurance (RHA) Best Practices (NESC-RP-19-01489)
  - Covers TID, TNID, and SEE
  - Development of new NASA technical standard for RHA to be released
- Application to COTS Electronics
  - Radiation effects issues with COTS parts are the same as with others
  - Guidance on robust methods to handle unit-to-unit variability
  - Guidance on test and evaluation to help address COTS testing challenges
  - Single-Event Effects Criticality Analysis

## Recent NASA Guidelines

- Recommendation on Use of Commercial-Off-The-Shelf (COTS)
   Electrical, Electronic, and Electromechanical (EEE)Parts for NASA
   Missions (NESC-RP-19-01490)
  - See both Phase I & II
- Highlighted finding:
  - F-4: There is a lack of consensus within NASA on the perception of risk using COTS parts for safety and mission critical application in spaceflight systems. It varies from feelings of "high risk" when part-level MIL-SPEC /NASA screening and space qualification are not fully performed to "no elevated risk" when sound engineering is used, and part application is understood.

## Attribution for existing content

- A lot of this content has been previously put together
  - Radiation Effects & Analysis Group (REAG) members: Megan Casey, Ken LaBel, Ray Ladbury, Jonny Pellish, Ted Wilcox, Mike Xapsos, and others
  - Outside help: Jet Propulsion Lab (JPL), Radiation Test Solutions (RTS), Univ. Tennessee Chattanooga (UTC), and others
- You can find those resources readily in NASA Technical Reports Server (NTRS) by searching for:
  - Texas A&M University (TAMU) Cyclotron Facility Bootcamp
  - NASA Space Radiation Lab (NSRL) Radiation Test Workshop
  - NASA Electronic Parts and Packaging Electronics Technology Workshop (NEPP ETW)
  - SEE/MAPLD
- NASA Engineering & Safety Center (NESC) Academy has video content of radiation 101



## Radiation tools out there (free)

- SmallSat / System Architecture
  - R-Gentic <a href="https://vanguard.isde.vanderbilt.edu/RGentic/">https://vanguard.isde.vanderbilt.edu/RGentic/</a>
  - SEAM <a href="https://modelbasedassurance.org/">https://modelbasedassurance.org/</a>
- Rate Calculations
  - CRÈME <a href="https://creme.isde.vanderbilt.edu/">https://creme.isde.vanderbilt.edu/</a>
- Environments and Transport
  - Spenvis <a href="https://www.spenvis.oma.be/">https://www.spenvis.oma.be/</a>
  - OMERE <a href="http://www.trad.fr/en/space/omere-software/">http://www.trad.fr/en/space/omere-software/</a>
  - OLTARIS <a href="https://oltaris.nasa.gov">https://oltaris.nasa.gov</a>
  - SRIM <a href="http://www.srim.org/">http://www.srim.org/</a>

michael.j.campola@nasa.gov

# THANK YOU