When Bugs Escape

The ability to find bugs has not kept up with the growing complexity of systems. Bugs are more likely to end up in products than ever before.

popularity

Bugs are a fact of life, and they always have been. But verification methodologies may not have evolved fast enough to keep up with the growing size and complexity of systems.

The types of bugs are changing, too. Some people call these corner cases. Others call them outliers. Still another group refers to them as simulation-resistance superbugs.

In markets such as automotive, the notion of bugs is evolving. Designs must be resilient not only to random faults, but must also be able to detect and recover from systemic faults, as well.

Some product categories are being asked to last a lot longer than traditional consumer devices, and they are being built to be upgradeable in the field. This changes the notions of verification from being a static task conducted during the design phase, into a task that is performed over the lifetime of the product. How are the tools evolving to address these challenges?

Fig. 1: Why bugs need to be caught early. Source: Intrinsix

Where we are today
Verification methodologies have evolved over time. “Today, the industry does layered verification,” states Chirag Gandhi, director of verification for Arteris IP. “It starts at the unit level and there are assumptions on the inputs and outputs at the unit level. It then goes to the sub-system and full-system level. As many layers are defined in the hierarchy as needed. It can go to emulation where it can exercise the use-case that the chip is intended for. The next level above that is silicon, where you have the actual silicon and you do testing before you ship the product. You try to cover as many cases as possible to make sure that you don’t have corner-case bugs.”

That may not be enough, however. “Outlier bugs are flaws that can only be exposed by exercising the design outside its normal flow or operating parameters,” points out Harry Foster, chief scientist for verification at Mentor, a Siemens Business. “A corner case bug is one example of outliers, which are byzantine in nature and often require a complex sequence of events to occur before they are exposed. The problem with outliers is that it can be extremely complex to set up the appropriate scenarios to expose them. Hence, they are often identified late in the verification/validation cycle and potentially can have significant impact on the project schedule. If the outlier is an architectural flaw, the impact could be devastating.”

The industry does not have the tools to tackle them. “Pseudo-random stimulus generation is unlikely to trigger such a precise sequence,” points out Tom Anderson, technical marketing consultant for OneSpin Solutions. “In addition, hand-written directed tests may not exist since the verification team can’t think of every possible corner-case situation and verify it explicitly.”

All the contributors for this article were unified in one aspect of this problem. “We have demonstrated that most of the bug escapes are tied to parallelism,” says Craig Shirley, CEO for Oski Technology. “We have gone through published errata on some chips and we look and classify the bugs. It is amazing how parallelism-related bugs rise to the top in terms of being the most common.”

Parallelism creates temporal uncertainty within systems. “Timing is becoming increasingly important and is transactional in nature,” adds Adam Sherer, product management group director in the System & Verification Group of Cadence. “This could be asynchronous between clocking domains, or it is temporal in terms of the arrival of signals based on transactions. So we are looking at transaction resilience to events.”

There are other reasons why these are becoming more common. “It is hard to get into buggy states in simulation because you don’t really know what those states actually are,” says Rajesh Ramanujam, product marketing manager for NetSpeed Systems. “What are the temporal scenarios that cause a system to be buggy? Today we live in an integration world and every SoC contains a number of IPs from third parties. Nobody knows how the IPs work internally. There is a lack of transparency about how an IP is built.”

Roger Sabbaugh, vice president of applications engineering at Oski Technology, defines simulation-resistant superbugs: “Functional bugs that are resistant to detection during simulation and even emulation. They arise because extreme corner scenarios are required to activate and detect them. This task has proven too much for traditional functional verification methods to tackle, so superbugs are typically initially discovered in silicon — and sometimes by the end customer.”

There are challenges with extending existing methodologies to cover some of these issues. “These bugs are generally concurrent in nature,” notes Foster. “Unfortunately, humans have a difficult time reasoning about concurrency, which means that creating a coverage model or set of directed tests to expose this class of bugs are generally ineffective.”

How we got here
Moore’s Law is ending for most application areas. ” Without the ability to rely on gains from scaling, designers have to get more creative,” says Vigyan Singhal, founder of Oski Technology. “The only way in which people have been creative with design is by adding parallelism and by improving power through clock gating and managing activity in the design. Those are the primary reasons why simulation-resistant superbugs have come about, and it is a minefield. Customers are finding more of them today than previously.”

Most of the existing verification tools focus on functional errors. “Ten years ago verification engineers only cared about functional errors, and there was little focus on performance,” says Gandhi. “Today, it is part of the verification planning phase. You have to look at the interactions between units to see if it causes problems that result in performance bugs.”

And this is not just a problem for hardware. “In a single-core environment or single-IP environment, people are familiar with the software interactions and the hardware processes,” explains Kevin McDermott, vice president of marketing for Imperas. “But in modern chips there will be multiple processors, different architectures with different interactions, and this is what people have to focus on today. How are these systems going to communicate and interact with each other? They also must have a structure should something happen that is not planned or expected. How does the system accommodate errors or faults, or unavailability of expected services?”

Utilizing formal technology
One area that has developed to address the challenge is formal verification. “Back in the days when systems were much simpler, simulation was good enough to explore the space and find the bugs,” says Netspeed’s Ramanujam. “Then people realized that the state was becoming bigger and they needed the help of emulation to run more cycles. And then systems got even bigger and it made people go to formal verification, which is mathematically proving that the RTL actually functions correctly against the spec.”

Formal also allows different kinds of issue to be found. “A lot of customers say they have had deadlocks with previous generations,” adds Ramanujam. “Deadlocks mean you are hosed. Because the space is so huge, trying to make sure it is deadlock-free is almost impossible especially when doing monkey testing, which involves brute force running of multiple simulations. Even at the end of it you do not know that you have hit every scenario. With formal, it makes the job a lot easier.”

There are other types of bugs that are much better suited to formal. “Resource deadlocks, system hangs, or compromised security,” says OneSpin’s Anderson. “Only formal techniques can predictably find outlier bugs during pre-silicon verification. By exhaustively analyzing the chip design against a set of assertions specifying intended behavior, formal verification can set up every possible corner-case scenario to see if any bugs are revealed. When the analysis stops finding bugs, formal tools can prove that no further bugs exist. This level of certainty is impossible to obtain from simulation, emulation, or even testing of chip prototypes in the lab.”

But formal has limits. “Formal can theoretically find every possible bug, but it may take orders of magnitude more time,” says Gandhi. “Formal can be used for unit level, but you often can’t use it on bigger blocks just because the depth to figure out any bug can be too difficult and may take a lifetime to complete. What can be done today, is to use models that represent RTL. In these models, we can abstract out a lot of things that cause depth problems. This can decrease the amount of time it would take to verify the architecture. We also use formal on small blocks, especially in areas such as deadlock avoidance. It is a lot better to use formal for these tasks.”

Adding Portable Stimulus
Accellera has just released version 1.0 of the Portable Stimulus Standard (PSS). “The new portable stimulus standard can help uncover outlier bugs too,” says Foster. “For example, we have applied classification machine learning to our PSS graph-based technology tool to enable the actual targeting of scenarios not yet verified. This technique explores the coverage space more efficiently and productively than traditional constrained-random simulation. We also have applied data mining technology to extend the application of Portable Stimulus beyond verification by collecting and correlating transaction-level activity to characterize performance parameters. For example, fabric routing efficiency and bandwidth, system-level latency, cache coherency, arbitration efficiency, out-of-order execution, and even opcode performance. The key to finding outlier bugs requires sufficient cross-exploration between the design and input space and that is where constrained-random, formal technology, and PSS graph-based intelligent stimulus can help.”

PSS will also provide the ability to create more complex tests during system-level verification. “Once a bug is found and isolated, it can still be difficult to understand and to create a test that we can actually debug to figure out how to fix it and add it to the regression suite so that it doesn’t break again,” says Larry Melling, director for product management and marketing at Cadence. “We are talking about many processors doing operations concurrently, stressing the buses to be able to hit those timing problems. This isn’t a methodology and approach that’s working for everyone today, but many are certainly looking to apply the technology to see if we can improve upon the current situation.”

Bug escapes
Even with the addition of new tools and technologies, bugs will escape. “When you have a bug sighting, especially in silicon, you don’t always know if it is a bug or not,” says Gandhi. “You have to try to find the cause for the observation. There are features that should be available in-chip, such as scan chains, a way to read data out of important structures inside the chip, some form of trace debug or performance monitors. Using this information, you try and figure out if there is a bug.”

On-chip debug and awareness will become much more important. “People are recognizing that we have devices with multiple processors sitting there, so we can use the processor in-situ to help figure out what is happening,” says McDermott. “You need to know the traffic on internal buses, and if different events occur that could be unrelated to the area in which you are trying to find a bug, that traffic is a factor that has to be observed. There are multiple approaches to it, virtual platforms and real hardware.”

Heisenberg’s uncertainty principle can come into play when trying to track down problems that have a temporal aspect to them. “You have to be running at full speed,” says Rupert Baines, CEO for UltraSoC. “Speed also matters because system-level problems only show up occasionally. On-chip instrumentation can provide an abstraction that is not about bits and bytes, it is not about JTAG messaging, but provides a system-level view. They can identify protocol performance, transaction-level statistical performance and help people identify anomalies. Once you have the area of problem, you can zoom into lower levels of detail.”

When a bug has been isolated, it needs to be fixed. “Once it has been found, you try and find workarounds,” says Gandhi. “This could be a software or hardware workaround, such as chicken bits, or doing something just a little bit differently, such as adding a small amount of time. While this could affect functionality or performance, in many cases it may not be important.”

This assumes that the product can be updated. “This is the new style of software development that people have adopted particularly for IoT,” says McDermott. “The old days of, ‘Build a product, test it rigorously, ship it, and you are done,’ does not apply anymore. People are familiar with field updates, and with embedded always-on, always-available devices. Continuous updates are par for the course.”

Automotive is adding to the demands. “The detection and correction mechanisms that are designed into the product will depend on the safety impact from either a systematic failure (such as a corner case that was never identified during functional verification) or a random hardware failure,” adds Foster. “But you can see that the goal is not really to detect an outlier per se, but to detect a failure that would have an impact on a safety goal.”

And with the smallest of geometries, transient bugs also may need to be detected. “If an alpha particle flips a memory bit, the chip must be able to detect and possibly correct for this error,” points out Anderson. “Formal tools can consider a range of possible faults, both transient and permanent, and analyze their effects on the design.”

Systems increasingly are being deployed in which the hardware can be updated. “Today, updates are primarily firmware,” says Sherer. “But it could be an FPGA update. Then it is a hardware update, as well. There is a looming complexity that will need a solution.”

Machine learning adds another layer of complexity. “With ML, you can verify the system and how it is intended to work,” adds Ramanujam. “But what if it has a neural network that involves learning? The system may not react in the same way to a given stimulus today compared to yesterday. It learns by itself. How do you verify a system that is learning?”

This will require some new capabilities. “What you are describing is a system in which you see a failure under a certain set of operating conditions for that system in the field,” he says. “How do you replay that entire environment in a debuggable toolset? It is not as simple as just rerunning a test. We will need some ability to load the state of the system into a debuggable environment, because debug is dependent on the learned state. Then it turns into a security problem.”

Conclusion
The nature of verification has changed, and it will continue to change. “Traditional methods work well for sequential blocks, but the game has changed,” says Singhal. “You have to step back and restart verification. What do you do when silicon comes back? That is too late. You have to plan verification differently.”

Singhal suggests that you have start with the design process. “When you architect the design or make decisions about the blocks and microarchitecture you must understand concurrency versus sequentiality as a first-class decision point.”

Ramanujam notes this also requires a philosophy of not putting the bugs in. “We have a very modular architecture and a simple and elegant way of keeping the design space a lot smaller by using a few routers, switches, and bridges. We built the interconnect and can control exactly what the states are, and we can verify that state space in its entirety. This is why we do not just supply them as LEGO blocks and allow the user to build an interconnect.”

Many aspects of product development, verification and maintenance have to be rethought. While some important pieces already are in place, no methodology yet exists to pull everything together.



Leave a Reply