Mitigating The Effects Of Radiation On Advanced Automotive ICs

An automated, systematic approach to identifying design susceptibility to single event upset (SEU) through structural and static analysis.

popularity

The safety considerations in an automotive IC application have similarities to what is seen in other safety critical industries, such as the avionics, space, and industrial sectors. ISO 26262 is the state-of-the-art safety standard guiding the safety activities and work products required for electronics deployed in an automotive system. ISO 26262 requires that a design be protected from the effects of radiation-based events that have the potential to violate a safety goal. In automotive applications, radiation-based events fall under the term random hardware faults, which are unpredictable throughout the operational life of a vehicle. Companies remain continually challenged to deliver feature-rich products on time and on budget, while simultaneously ensuring high availability and correct operation under all conditions. Therefore, the demand for high availability is driving the need for more robust verification of random hardware faults. Faults can be caused by many different mechanisms, but within the context of this article the term fault refers to these random hardware faults.

Current approach and its drawbacks

The diagram below outlines a generic ASIC development flow in grey. When developing an ASIC compliant to ISO 26262 there are additional phases, which are shown in blue.

Figure 1. ASIC development flow modifications for automotive.

Traditionally, expertise-driven judgement has been the technique commonly deployed to identify failure modes due to random hardware faults. The creation of a Failure Modes Effects Diagnostic Analysis (FMEDA) is both a common approach and the work product when estimating failure modes. An FMEDA uses mission environment, failure rate data, and target technology to calculate design susceptibility to faults.

There are number of drawbacks associated with depending solely on expert-driven fault analysis:

  • Not repeatable
  • Not exhaustive
  • Does not scale well
  • Difficult for third-party IP, legacy IP, or machine-generated code

Best case: a failure to accurately predict failure modes and effects sets up a late-cycle verification activity resulting in pre-silicon test failures. Worst case: the lack of fault protection leads to a production run-time scenario that results in a safety goal violation and potential hazards to persons or property.

Once failure modes are identified, designers harden their design based on the initial expert-driven analysis. This protection can come in the form of hardware or software. Common implementations are error-correcting codes (ECC) on memories, redundancy on design blocks, or cyclic redundancy checking (CRC) on data paths, to name a few. The remainder of the article will refer to all of these protection implementations generically as safety mechanisms.

The last step in the fault mitigation workflow verifies the design is sufficiently protected from faults.

Fortunately, a new breed of verification technologies and a three-phase methodology has been born to assist experts in analyzing, protecting, and testing a design for random fault detection and mitigation effectiveness. This three-phase methodology provides a systematic pathway to protecting a design. The remainder of this article will detail this methodology, which is guided by automation to systematically protect a design from faults and prove through metrics that this has been fully accomplished.

Phase 1: Fault analysis

Augmenting expert analysis with automated structural analysis of fault protection and the logic it protects provides a higher level of confidence and reduces the iterations required to develop a robust design. Such solutions enable engineers to accurately estimate metrics, such as fault coverage (DC), which is an indication of how much of the design is protected from faults. There are many advantages of using this more automated, structural approach over traditional, manual fault analysis approaches. Some are summarized below:

  • Automated and repeatable
  • Exhaustive
  • More scalable
  • Works on third-party or legacy IP (a deep understanding of IP is not required)

Fault analysis can be subdivided into two activities. The first activity is an initial design assessment which validates the accuracy of expert judgement and metrics such as failure in time (FIT) and DC. Structural connectivity information is used to identify areas of the design that can be impacted by faults and, correspondingly, how effective fault protection logic is at mitigation. This bottom-up approach provides a high level of accuracy compared to a top-down, expert-driven analysis. Instance and end-to-end analysis represent the two types of analysis used to calculate FIT and DC.

End-to-end analysis estimates the fault coverage of end-to-end protection mechanisms covering multi-cycle logic. Using this technique, the fault coverage is calculated for all gates and state elements between the generation and check points of protection mechanisms. Data path parity and CRC are common examples of multi-cycle protection mechanisms.


Figure 2: End-to-end structural analysis.

Instance analysis estimates the effectiveness of fault protection on hardware mechanisms that protect single instances like modules or flip flops. Using this technique, the fault coverage is estimated for the protection mechanisms protecting state elements, modules, and their localized cones of logic. Three common examples of instance level protection are flip-flop parity chains, full module level duplications, and memory ECCs.


Figure 3: Instance structural analysis.

For example, in a mission critical block where ECC was already implemented on an internal FIFO memory, instance analysis will indicate the maximum achievable fault coverage that ECC can provide for that memory.


Figure 4: End-to-end structural analysis.

In the event the initial coverage assessment identifies gaps in fault protection, a strategy for mitigation must be derived. This strategy typically consists of either modifying existing or adding additional fault protection hardware.

The second activity is safety exploration. The objective of exploration is to identify a mitigation strategy which provides the desired level of fault protection while simultaneously meeting power and area requirements. When deciding on a safety strategy, an engineer has many safety mechanism choices, each with their unique level of effectiveness and impact on power and area. Understanding these varying efficacies and impacts is critical in guiding the exploration of different mitigation strategies. Some fault mitigation techniques also have the ability to detect errors, while others are more advanced and can correct single bit errors. Table 1 summarizes this point across a handful of implementations.


Table 1: Fault protection mechanism summary.

During this step, engineers perform a series of “what-if” experiments to understand the impact of different fault mitigation strategies on power, area, and fault coverage. This exploration is performed without modification to the design, allowing for multiple parallel analyses.

Examining Figure 4 above, it’s clear that additional coverage is required on unprotected mission critical blocks. Exploration is performed and an implementation is identified.


Figure 5: Fault exploration using register parity, FSM duplication, and end-to-end CRCs.

The outcome of exploration is a clear understanding of the design enhancements required to meet the safety requirements.

Phase 2: Fault protection

Fault protection circuitry comes in a variety of flavors, each with its own level of effectiveness in detecting faults. Typically protection mechanisms are bucketed as either fail-safe or fail-operational. Fail-safe mechanisms are capable of fault detection but do not guarantee correct operation through the fault. Fail-operational protection mechanisms are capable of detecting and correcting random hardware faults. Fail-operational protection typically involves redundancy and incurs a higher resource utilization but provides the added benefit that the design will continue to function correctly through the fault.

Traditionally fault protection logic is inserted manually. This requires engineers to modify code with protection logic. This can be disjointed and inconsistent, requiring updates as designs are moved to different process technologies that require different levels of protection. A few of the drawbacks to this manual approach are summarized below:

  • Not scalable to large designs
  • Tedious to change protection schemes
  • Error prone and inconsistent
  • Difficult to perform on third-party IP or machine generated code

After insertion, logical equivalency between the original and enhanced design must be performed to ensure that no functional deviation has been introduced.

Phase 3: Fault verification

Fault analysis can prove that a design is protected once protection mechanisms are in place, but it is important to prove that the design operates correctly in a faulted state. Fault verification must be performed by simulating the design’s behavior in a faulted state to ensure correct operation. One simple example would be a CRC error causing a recovery mechanism to fail. Even if the CRC logic was implemented correctly, a recovery mechanism that caused the retransmission of a packet could be in error. Fault verification closes the loop by ensuring that the design is robust and will operate correctly through faults.

The primary objective of fault verification is to inject faults into a design, propagate the faults in simulation, and identify if the fault is detected.


Figure 6: Detected vs. undetected fault classification.

Using the results of fault injection, the overall fault coverage is calculated by comparing faults detected by protection mechanisms to the entire fault state space.


Figure 7: Calculating fault coverage from fault classification.

It is not uncommon for designs to have millions of potential fault nodes, so every means must be taken to reduce the fault list to a minimal set of fault nodes and then leverage automation to inject faults as efficiently and effectively as possible. Purpose-built fault simulators have been developed to address these gaps and challenges. To achieve maximum performance, three levels of concurrency are deployed. First, faults are injected using a concurrent fault injection algorithm that provides parallelism across a single thread. Multithreading and multicore adds another level of fault injection concurrency. Lastly, fault injection jobs are further distributed across the larger machine grid. Fault management oversees the job distribution and the coalescing of the resulting data.

Conclusion

The evolving size and complexity of designs demand the use of more automated and scalable techniques in the development of ICs used in automotive applications. Manual processes for developing and validating safety mechanisms do not scale to very large designs, are not conclusive or exhaustive, and are not easily repeated. Formal verification technology and automation provide a much more effective, higher quality, and repeatable process for fault analysis, protection, and verification. This approach arms project teams with the information they need to create the next generation fault tolerant designs.

For more information, download our whitepaper “Rethinking Your Approach to Radiation Mitigation.”



Leave a Reply


(Note: This name will be displayed publicly)