Fault Simulation Reborn

A once indispensable tool is making a comeback for different applications, but problems remain to be solved.

popularity

Fault simulation, one of the oldest tools in the EDA industry toolbox, is receiving a serious facelift after it almost faded from existence.

In the early days, fault simulation was used to grade the quality of manufacturing test vectors. That task was replaced almost entirely by Scan Test and automatic test pattern generation (ATPG). Today, functional safety is causing the industry to dust off the cobwebs and add a few new tricks into the fault simulation tool box. But not everything is working smoothly yet.

Fault simulation is now being used for three independent applications. It continues to be used for some manufacturing test applications. It is now being used to measure the quality of the functional testbench so that product quality can be increased. The third application, and the one driving the resurgence of fault simulations, is the need to find out if a running design can detect and recover from failure, thus ensuring safety of operation.

Before looking into the application areas, a short review of the primary technology associated with fault simulation may be useful. There are two major pieces: the fault model and the simulation technology.

At the heart of every fault simulator is the ability to take the fault free design and compare it to a design in which a single fault has been injected. If the designs produce the same result, the fault is undetectable, but if they produce different results, the vectors are capable of detecting that fault. Faults that produce unknown behavior, or result in an X value are considered potentially detected.

Fault model
The fault model describes the faults that are to be injected into the design. This is not the same as the faults that may physically exist in a design. The fault model is considered to be good if there is a good correlation between high fault coverage and low defect escapes. The earliest fault model was the “stuck-at” model, where every node in a gate-level design can have two faults: stuck-at-1 and stuck-at-0. It was shown that if all of the stuck-at faults were detected, then the vector set used would be good at finding all of the manufacturing defects in a chip.

Times have changed. The stuck-at model is no longer adequate, even for manufacturing test. Timing faults, transitory faults and other fault models have become necessary. In addition, various markets are searching for higher-level fault models that can be applied at the register transfer level (RTL) or even higher levels of abstraction.

As designs are fabricated on smaller geometries, faults that were unlikely to happen in the past become more likely, such as slow devices that need at-speed testing to be detected.

In addition, functional safety is also looking at an operating design. “If the design is up and running, we need to consider temporal faults,” says Adam Sherer, verification product management director at Cadence. “Temporal faults can happen at any time and may be a soft error, such as a bit flip from alpha particle decay, then there are the permanent faults which may be a transistor that has broken down and creates a stuck-at fault while the device is operating.”

So what would a fault model for transient faults look like? “The higher echelons of the ISO 26262 standard recommend that there is a need to support transient faults,” says David Hsu, director of product marketing for simulation products at Synopsys. “These are either single event upsets or single event toggle type of faults. These are not relevant to the manufacturing aspect and probably the biggest class of faults for functional safety and security.”

The number of stuck-at faults is correlated to the number of gates, but for transient faults there is an arbitrary number. “You can pick some number of faults for every clock cycle in the design,” adds Hsu. “We have heard customers say that a conservative number for relevant transient faults could go into tens of millions or hundreds of millions.”

That represents a problem, and nobody has yet managed to find a small enough fault model that encapsulates this and shows good correlation with fault detection in the field.

Fault simulation algorithms
Fault simulation is resource-intensive. At a minimum, you need to simulate a good machine and a faulty machine. “We are already exceeding the run time available for the logic simulator,” says Hsu. “And then on top of that you add the stuck-at fault model, and that’s probably 2.5 times the number of gates for the fault list size. So if you have a million-gate block, then you have 2.5 million faults.”

There are three primary simulation mechanisms: concurrent, parallel and distributed. Many people confuse the term parallel and distributed, but they have very distinct meaning when applied to fault simulation.

The concurrent fault simulation algorithm, developed in the early 1970s, simulates the fault-free circuit and any part of the faulty circuit when the fault creates different signal states compared to the fault free circuit. It has the advantage of being highly efficient, but the disadvantage of requiring large amounts of memory if faulty circuit activity does not re-converge with the fault-free design, or if the fault is not detected for a large number of clock cycles.

“The concurrent algorithm has remained largely unchanged,” says Sherer. “What has changed is that the algorithm has been brought onto a compiled code simulation engine. That works for stuck-at faults, but does not work well for temporal faults, mainly because the memory grows too fast. For temporal faults, you may have to run tens of thousands or millions of clock cycles, meaning there is a lot more data to be saved. So the concurrent engine trends to contain a single fault per run just because of the memory requirement.”

Another technique from the past was parallel fault simulation. For logic that could be expressed in a single bit, logic 0 and 1, faulty circuits could be packed into the bits of an integer. So on a 64-bit machine, 64 fault circuits could be evaluated at the same time just by doing integer arithmetic.

“For simulating ATPG patterns, parallel fault simulation is a fantastic technology,” says Hsu. “It means you’re able to do these relatively well contained vectors and do them in parallel. However, for the more general class of fault simulation, what we found was that it didn’t scale. You needed the concurrent algorithm to move beyond 64 faults in parallel on a single process.”

Because of these limitations, the trend today is to run faults singly. “This starts to make distributed fault simulation look more attractive because the faults can be spread over a large number of machines,” adds Hsu. “This also makes the process more amenable to hardware acceleration.”

Back in the 1980s, Zycad produced special-purpose hardware that could do fault simulation, but this company went by the wayside when fault simulation went into decline. Nobody has yet seen the economic opportunity for a modern–day equivalent.

Manufacturing test
Manufacturing test was the initial reason for fault simulation’s existence and in the early 1980s, it was tester companies that pioneered much of what became known as Electronic Design Automation. “Around 1998, ATPG swept into the test industry,” says Sherer. “It was able to reach test coverage closure much faster than traditional fault simulation could. When you’re reaching 98% closure rate on a 200,000 gate design, the number of faults undetected is not that big and is something you might be able to get a waiver on, or test manually.”

But today, a design block may run into the millions of gates and 2% becomes a much larger number. “The proportion of the designs that are applicable for scan test is not 100% and I think for many applications it’s actually reducing,” says Hsu. “Therefore functional patterns are still needed for the non-scan portions of the design.”

The primary areas which are not covered by scan include peripheral logic, high-speed logic and the growing analog mixed-signal content.

Time on the tester is expensive. “One of the problems associated with manually created tests is they may not be the most efficient,” says Stephen Sunter, engineering director for mixed-signal DFT at Mentor Graphics. “By measuring the coverage of each test in a suite, we can see which are unnecessary, or assessing alternative simpler, quicker tests before validating them in production.”

Functional verification
While much of the manufacturing test process has been automated, many questions remain about the quality of a functional verification testbench, especially when IP is being delivered as pre-verified content. A technique was developed to measure testbench quality called mutation coverage.

“Mutation coverage is the injection of a fault into the design and then you see if the verification testbench would detect it,” explains Roger Sabbagh, vice president of application engineering at Oski Technology. “Most Coverage metrics are based on stimulus or the controllability of the design by that stimulus. Did my stimulus get the design into the particular state such that some part of the design would get exercised? What has been lacking are observability metrics. If there was a bug in a certain part of the design, would we know it and would the checkers detect it. Fault simulation coupled with mutation coverage tells us about that.”

This also extends into formal verification as well. “With formal you can write a bunch of checkers or assertions and they could all be proven,” adds Sabbagh. “But you could be missing an assertion or they may not be of good enough quality to catch all of the bugs. Mutation coverage helps to fill these holes.”

Safety
The biggest reason for the resurgence of fault simulation has been safety. “When you execute faults in a test application, you want to inject signals on the inputs and determine if a fault can be detected on the outputs,” says Sherer. “Functional safety is about being able to demonstrate that internal detection systems are functioning properly and that correction systems are doing their job.”

This involves constant monitoring of the design. “You have to be able to detect that a fault has occurred, determine if it is dangerous and take an appropriate action,” he explains. “That implies two stages of analysis. The first is the traditional fault propagation to a point of detection. The detection point is likely to be internal to the chip. Then you need to make sure that the response systems, be they software-based or hardware-based, detect the error and produce an appropriate safety acknowledgement or signal within a specified period of time.”

“Fault simulation has become almost a requirement of the ISO 26262 standard,” says Hsu. “Not that it says, ‘I shalt do fault sim,’ but it is one of the strongly recommended, and actually very popular, ways that automotive customers are trying to do random defects validation. Being able to contemplate those kinds of things modeled as random faults and then inject those faults into the design and simulate those faults—that’s exactly the methodology that’s become very popular.”

Today, the ISO standard requests high fault coverage at the gate level. “So if you were living up to the letter of the standard you would be doing it at the gate level,” continues Hsu. “Going forward, there are advantages to go to a higher level of abstraction, especially for things like transient faults.”

But traditional fault simulation may not be the best answer. “Diagnostic coverage verification for ISO 26262 qualification is performed not on the full design, as it is for manufacturing test, but rather on specific design sections under specific constraints,” says Jorg Grosse, product manager for functional safety at OneSpin Solutions. “For example, the CPU on an SoC might be the focus of diagnostic coverage, constrained by switching off the debug logic during operation. Because of these operating conditions and the fact that it is difficult to come up with a good set of test patterns, the traditional fault simulation process is inadequate on its own, leaving the user with many faults in unknown states.”

Grosse points out that formal verification has a role to play here. “Using formal as a platform, a new class of fault analysis solutions can quickly identify the status of faults that traditional fault simulators cannot easily handle or where the stimulus is insufficient. This alleviates engineers from a lengthy, tedious and error-prone manual analysis phase that accompanies these fault simulation runs.”

Fault reduction
We have already mentioned the size of the fault lists, but is it necessary to run fault simulation on every fault? “We can apply formal methods to do fault reduction,” says Sherer. “But the space is still new and there are a wide range of requirements coming from the customer base. We can also apply statistical analysis and after running a few faults, extrapolate that to make a statement such as the likelihood of any fault within an area of the logic being propagate or detected.”

How large does that sample have to be? “Simple random sampling (SRS) was used for digital circuits 20 years ago, but that approach requires simulating a thousand or more faults to achieve a reasonable confidence interval for estimated coverage,” says Sunter. “Likelihood-weighted random sampling (LWRS) was first presented at ITC ’14 and it reduces the number of faults simulated by about 4X for a realistic distribution of defect likelihoods.”

This relies on having a solid fault model to begin with. Today, deciding which faults to inject may be based on experience. “As much as possible, be smart about how you identify the faults that you’re interested in,” says Hsu. “Then understand what you should expect in terms of runtime. You need a calculated strategy in managing your faults. You need to use all of the static testability measures to make sure that your fault list is pruned as much as you possibly can. If you have redundant faults there’s no need to simulate them because they won’t help at all.”

Conclusion
If fault simulation becomes a necessary tool to ensure safety, better tools will be needed. The industry has already had to rethink the simulation flow because of performance issues and emulation has risen to take on a large part of the load. Fault emulation is clearly a necessity along with a fault model that can show exactly how large the problem is. It also needs to provide a qualitative measure of the fault detection capabilities in a design. Without these in place, the only solution is one that involves heavy redundancy within the design.

Related Stories
Are Simulation’s Days Numbered? (Part 3)
Panelists discuss integration, visibility and an increasing number of issues that can only be addressed at the system level.
Rethinking Verification For Cars (Part 2)
Why economies of scale don’t work in safety-critical markets.
Verification Engine Disconnects
Moving seamlessly from one verification engine to another is a good goal, but it’s harder than it looks.