Reducing Your Fault Campaign Workload Through Effective Safety Analysis

Designs must be able to detect random faults and fail safely to meet standards, but still must consider the power and area impact of safety features.

popularity

As the automotive industry strives for greater levels of autonomous functionality, ICs will become integral in virtually every vehicle system. Companies previously embedded in non-safety critical markets are transitioning current technologies to the growing and rapidly evolving automotive market. These companies will face the unfamiliar challenges associated with having to enhance their IP to satisfy automotive safety requirements. This is no trivial task as the size and complexity of these technologies have grown exponentially. ISO 26262, the state of the art standard ensuring development of functionally safe products, details three high-level areas challenging project teams.

  • Lifecycle Management: Ensuring adequate safety processes are deployed during product development
  • Systematic Faults: Ensuring the design operates correctly
  • Random Faults: Ensuring the design fails safely in the presence of unpredictable faults

While lifecycle management and systematic fault analysis incur their own unique challenges, random faults receive much of the attention and are what the remainder of this article will discuss. To ensure random faults don’t affect silicon functionality and place humans at risk of injury, designs must be protected to detect faults and fail safely.

Historically, the automotive industry has addressed random faults using a combination of tools and expert judgement. Failure modes were identified using top-down expert driven judgement, the design enhanced to protect against those failure modes, and fault injection performed to prove the design fails safely when a failure occurs. Fault injection, commonly referred to as a fault campaign, produces the safety metrics that are required during assessments and audits. The ability to attain safety metrics is directly dependent on the accuracy of the safety analysis performed via expert-driven judgement.

With the increase in IC complexity, expert-driven safety analysis is no longer an effective approach. Automotive ICs are simply too large and complex to expect a human to fully comprehend all failure paths. Failure to perform accurate, upfront safety analysis leads to costly iterations throughout the safety workflow.


Figure 1. Functional safety random fault workflow comparison

The overarching goal of safety analysis is to fully understand the susceptibility of a design to random hardware failures and detail the additional steps required to achieve the desired safety metrics, as defined by the Automotive Safety Integrity Level (ASIL) target. A systematic approach to safety analysis will result in a more optimal safety architecture, a high level of efficiency during the fault campaign, and a higher probability of success in meeting the safety metrics.

Safety Analysis – Initial Safety Assessment
The initial safety assessment achieves two critical objectives. It validates top-down, expert-driven estimation and calculates the achievable level of safety by estimating the safety metrics for safety mechanisms already present in the design. In contrast to previous expert-driven methods, the safety assessment performs static structural analysis of the design to achieve a higher level of accuracy. This bottom-up approach consists of both instance and end-to-end analysis techniques. Instance analysis estimates the diagnostic coverage of safety mechanisms protecting state elements, modules, and their localized cones of logic; whereas cone of influence analysis estimates the diagnostic coverage of end-to-end safety mechanisms covering multi-cycle logic.


Figure 2: Instance structural analysis identifies coverage on safety critical design structures


Figure 3: End-to-end structural analysis identifies coverage on safety critical design structures

For example, in a safety critical block where error-correcting code (ECC) was already implemented on an internal FIFO memory, instance analysis will indicate the maximum achievable diagnostic coverage that safety mechanisms can provide for that memory.


Figure 4: Instance analysis of ECC on a FIFO

In the event the initial safety assessment indicates the achievable safety metrics do not meet the safety target, elementary failure data highlights the diagnostic coverage holes and prioritizes the design structures that require additional safety enhancement. Using this information, safety architects are empowered to explore enhancement options to achieve the desired safety target while taking into account power and area requirements. This task is referred to as safety architecture exploration.

Safety Analysis – Safety Exploration
Safety exploration is critical in successfully hitting the safety targets after safety insertion and safety verification are completed. The goal of safety exploration is to identify the optimal safety architecture and safety mechanisms. Safety mechanisms come in a variety of flavors, each with its own level of effectiveness in detecting random hardware faults. Typically, safety mechanisms are bucketed as either fail-safe or fail-operational. Fail-safe mechanisms are capable of random fault detection. Fail-operational safety mechanisms are capable of correcting random hardware faults; they typically incur a higher resource utilization (power, performance, area) and are required to attain the most stringent safety targets (designated as ASIL D).

Engineers have a suite of hardware safety mechanism choices, such as:

  • Flip-flop parity, duplication, and triplication
  • Finite state machine protection
  • ECC and triple modular redundancy
  • Module-level lockstep and triplication
  • End-to-end parity and cyclical redundancy checks

In safety exploration, engineers perform a series of “what-if” experiments to understand the impact of different safety mechanisms on power, area, and safety metrics, such as diagnostic coverage. This exploration is performed without modification to the design, allowing for multiple parallel analyses.

Examining the example above, it’s clear that additional coverage is required on unprotected blocks. The safety architect will explore the usage of different safety mechanisms and evaluate the resulting achievable diagnostic coverage and impact on power, performance, and area.


Figure 5: Safety exploration using register parity, FSM duplication, and end-to-end cyclic redundancy checks (CRC)

Once the optimal architecture is identified, engineers use the resulting knowledge to insert safety mechanisms into the design.

Safety Analysis – Fault List Generation
The final component of safety analysis is the generation of the fault list, a list of design nodes where faults are injected and their effects classified. The initial fault list is automatically generated using the same structural analysis techniques utilized during the safety assessment step and represents the full fault state space. Once generated, a series of fault optimization tasks reduce the fault list to a minimal problem set.


Figure 6. The three steps in the fault list optimization flow

The first optimization identifies the logic contained within the safety-critical cone of influence, eliminating out-logic that cannot affect the safety goals. Similarly, employing the same structural analysis algorithms deployed during safety analysis, the fault list is further optimized using safety mechanism-aware analysis, trimming the list to contain only faults that contribute directly to diagnostic coverage. Lastly, fault collapsing is performed to remove any logically equivalent faults. For example, a stuck-at 0 fault on the output of an AND gate is equivalent to any of its inputs being stuck-at 0, resulting in a reduction in the number of fault nodes for the gate.

Even with the optimization and reduction techniques defined above, the fault list may still be unmanageable. In this instance, statistical fault sampling can further reduce the scope of the fault campaign. Likelihood-weighted random sampling drastically reduces the fault list while providing confidence that the design is safe from random hardware faults.

Conclusion
Bottom-up safety analysis is important in reducing the number of iterations throughout the workflow. It validates expert-driven analysis estimations, guides engineers on the additional safety mechanisms to be deployed, and reduces the scope of the fault campaign to a minimal workload.

For more information and relevant papers, please visit our Mentor Safe page.



Leave a Reply


(Note: This name will be displayed publicly)