Utilizing More Data To Improve Chip Design

Different ways of collecting, analyzing and applying that data to improve efficiency and reliability.

popularity

Just about every step of the IC tool flow generates some amount of data. But certain steps generate a mind-boggling amount of data, not all of which is of equal value. The challenge is figuring out what’s important for which parts of the design flow. That determines what to extract and loop back to engineers, and when that needs to be done in order to improve the reliability of increasingly complex chips and reduce the overall time to tapeout.

This has become essential in safety-critical markets, where standards require tracking of data analysis. But beyond that, analyzing data can help pinpoint issues in design and manufacturing, including issues stemming from thermal and other physical effects to detecting transient anomalies and latent defects. Along with that data, new techniques for data analysis are being developed, including the use of digital twins, artificial intelligence and machine learning.

In logic verification flows, the data from testbench code coverage is analyzed exhaustively to show which parts of the RTL version of designs have been exercised, and to what extent they have been tested. This data analysis is critical to the logic verification flow. Other areas of the IC design flow generate a lot of data that traditionally has not been analyzed as exhaustively because it simply provides an indication of whether chips are functioning as expected. That is changing, however.

“With the increasing emergence of digitalization, autonomous and AI in just about everything, all data is growing in importance, said Joe Sawicki, executive vice president of IC EDA at Mentor, a Siemens Business. “The growing number of safety standards also require new levels of tracking processes and mandate documentation of the data generated by all these processes and tools. Increasingly these standards call for extensive testing at all levels of development—from IP to block to chip to PCB to ECU to entire electronic system to end product electrical/mechanical—and that test data needs to be documented.”

It also needs to be looked at in context, which is where some of the new tools become critical. In automotive and some medical devices, for example, one system may need to fail over to another. This is where AI and concepts such as digital twins fit in, because these devices need to be tested in virtual scenarios running real software before the products are manufactured and tested in the real world.

“All that needs to be tested, and the methods of testing and results need to be documented,” Sawicki said, noting that to achieve this, integrated solutions are essential to not only enable design and verification teams to test and collect data at the level of the systems, across engineering disciplines, and as a digital twin, but then to use that data to produce more robust next-generation products and even more efficient ways to manufacture those products. “Having more data, and sharing that data between tools will enable the emerging-generation of AI/ML-powered tools to be better trained to perform tasks faster and with greater accuracy—not just for the tools in traditional IC EDA flow, but for all of the tools involved in building better, more complete digital twins.”

Big data problems and approaches
Complicating all of this is the growing complexity and size of chips, and the amount of data being generated by everything from layout to regressions. That data needs to be read and analyzed, but it also requires an actionable plan.

is the sheer magnitude of data sizes, remarked, so the order of the day is first, read the data; second, analyze it; third, create analytics or actionable decisions. “What’s the value with all this data and the reports,” asked Vic Kulkarni, vice president and general manager of the Semiconductor Business Unit at ANSYS. “That’s a common question. For example, it’s very common to have between 21 billion and 25 billion transistors for AI chips, GPUs and CPUs. If you look at any new startups doing the AI chips for ADAS, for crypto, for any decision-making, some of the convolutional neural networks are are all in that range. These typically have 100 billion-plus connected regions/nodes, with each of those having multiple connections. As a result, 1 trillion pure RC components are on some of these ADAS type chips.”

Driving this analysis from the technical side is the multi-physics of power and signal integrity, reliability, as well as electromagnetic effects. “A multi-physics approach is required over time right from the good old analog transistor days, to micron, submicron, nanometers, and now to 7, 5, 4, 3nm,” Kulkarni said. “Moore’s Law has continued, although at 3nm it’s questionable, but all of these effects have started to become so critical that more and more physics gets added. Multi-physics effects start adding up as technology shrinks.”

Ansys has added machine learning on top of data management to identify false violations in its electromigration and circuit reliability technologies, which examine what patterns will cause reliability or aging issues. “Determining what data to manage and how to analyze it is tricky,” he said. This is particularly true for automotive applications, where LIDAR, radar and cameras all produce streaming data. Efficiently managing and analyzing this data may require different types of processors used in combination, including GPUs, CPUs and FPGAs, he noted.

Combining different technologies is becoming more common for analyzing everything from hardware and software performance to how they adhere to safety standards.

“We started with the identification of parallelism, but found that’s not really the issue for new types of applications,” said Max Odendahl, CEO of Silexica. “There may be enough concurrency and enough parallelism. The question is more at the system level—how to combine the dynamic data with what already has been written, and how that fits into a combined model. Engineering teams don’t need more parallelism. They need to figure out what’s going on in the system. What happens if it doesn’t work? Do they need more parallelism? Do they need better scheduling? Is a wider bus needed? Does the system need mapping of the threads to specific cores because it’s too dynamic? What’s the root cause of bugs? They have no clue. All they do is capture a huge amount of data. They can look at Excel charts of a thousand lines, but it doesn’t really tell them anything.”

What is needed is a more focused way to combine the data that has been captured based on application monitoring or dynamic analysis, and then relate this back to the logical software architecture to combine them into a unified model, Odendahl said. “What can I do with this? I can check that the way I implemented is actually how it’s specified. Or if somebody started hacking away, and management says it’s a product, now I suddenly need to do all kinds of validation tests, capturing all of that data to show it actually makes sense. That’s what we are hearing on a daily basis. Once I have this connection, I can get a lot of insights and try to find the root cause. Or it may be there are two runs with the same algorithms, but they behave differently. If all I’m looking at is an Excel chart, that’s really hard to find out. But if I have source code connection, I can tell that in Line 12 there was a recursive algorithm one time. It went 10 iterations, but based on the new data, I did 200 iterations, and that’s why we think it looks different. I can only do this if I connect those two worlds together.”

What’s different
In the traditional verification space, the scope of verification always has been viewed as a holistic process. But the definition of holistic has changed as chips increasingly are more tightly integrated into systems, and as systems interact with other systems.

“For the design and verification team, it’s really about predictability in the schedule, trying to understand if they are being as productive and efficient as they possibly can be, and whether they are targeting an acceptable level of quality for the delivery of their product,” said Larry Melling, product management director in the System & Verification Group of Cadence. “Verification is a stack, and the stack consists of the execution of the verification processes and different types of engines. The next layer, which consists of layers of abstraction, includes whether I can do this faster if I use a more abstract representation of something, and it’s equivalent enough for these cases. It’s kind of balancing the engines with abstraction to say I can get more performance by offering some level of abstraction. Then at the top level, even though we talk a lot about coverage and everything else, when you talk to users it’s all about bugs.”

In this space, verification management has progressed from coverage-driven to metric-driven, which is the mainstream today. This will evolve into data-driven verification, which requires a different way of looking at the verification problem.

“It’s learning to ask the right questions of the data and then visualizing those data sets in a way that says, ‘I see something here,’” Melling said. “If we can collect that information, and get this information, and do that analysis, then we can basically tell it what the next step is, as opposed to having an engineer in the middle of the loop all the time. It’s definitely an iterative process, but the good news is with this stuff it can move quickly once you get the engine built. So it’s doing the data collection with distributed data systems, such that you can query distributed data and create interesting data sets in a flexible manner. As that happens, we should start to see more rapid progress in terms of analytics and closed-loop feedback kinds of approaches.”

Others agree. “In the verification space, analytics data and metrics are used both to make better progress on a particular project and to improve tools for future projects,” said Dominik Strasser, vice president of engineering at OneSpin Solutions. “In fact, analytics and metrics are at the very heart of modern functional verification. Coverage metrics and verification results are the keys to knowing what has been verified, highlighting gaps in verification, and determining when the process is done. Metrics and results from formal analysis can be integrated with those from simulation and emulation with the assistance of new commercial tools. Results can be back-annotated to the original verification plan, and coverage results can be used to determine a minimal set of simulation tests. These steps converge on full verification and functional sign-off more quickly.”

During formal verification runs, additional metrics can be gathered on which formal engines are most effective for different types of designs and assertions, Strasser pointed out. “Modern formal tools have many such engines. Experts may wish to control them directly but most users just want the best possible results (proofs and bugs) as quickly as possible.”

And now, with machine learning, the algorithms that select which engine to run get better and better based on what is found to work best. This, in turn, speeds up the verification process on a given design. But the real win is that over time the collective wisdom from multiple projects improves the formal tools themselves, Strasser said.

Closing the loop
While lots of data is captured during the design and verification process, what’s more important is the process of capturing data from the real world and feeding that back into the verification process.

This is all about determining how software is really running on this hardware—with the right insight, said Rupert Baines, CEO of UltraSoC. “There are software tools, such as sampling profilers, but they are flawed. They are intrusive. They only see some activity because they are sampling. They introduce their own dependencies (heisinbugs), and crucially they affect performance—quite significantly. In contrast, hardware-based analytics are real-time, exhaustive coverage. This approach does not affect behavior and does not affect performance. They also give insight into things software cannot see—interconnects, memory controllers and caches. The latter two are famously complex and infamously responsible for a lot of performance issues. And as we know, caches can be responsible for a lot of security problems too, so analytics that can detect them can be invaluable.”

Understanding the interaction between software and hardware is critical. “Software performance depends on hardware,” Baines stressed. “Software integration and verification is now one the biggest single costs in SoC development, and it is getting worse with multi-core designs. Surprisingly often, subtle aspects of hardware design can have major impacts on software performance and behavior. Integration and verification to detect those, optimize performance, detect security issues, etc., is very hard. There is a need for analytics that gives real-time, non-intrusive, system-level insights.”

Targeting solutions
Still, data on its own is just data. “For data to become useful it needs to be contextualized,” said Stelios Diamantidis, product marketing director of the design group at Synopsys. “When you start thinking about the problem and contextualizing the data, it really always starts with what problem are you trying to solve. And that’s where we are. We’re beginning to understand a little better, and a little more systematically, how challenges and data can be combined to make data more meaningful. If we just look at the raw data, it is overwhelming in the context of design and verification where most of the problems we’re trying to solve problems are NP-complete problems and hence are very difficult in terms of containing and addressing them.”

Even where data is available, the usefulness of that data varies greatly. “The availability of the right data—not just the amount of data—is crucial to increase predictability, reliability and improve design and verification outcomes,” said Tim Whitfield, Arm’s vice president of strategy. “Arm has implemented a standardized framework (schema) to collect the right data across all projects and engineering flows. This allows for much wider and accelerated application of data science approaches due to easier transfer of outcomes across projects and general availability of suitable historic data for training and tuning. We see benefits of leveraging data in power, performance, coverage, reduced testing efforts, failure analysis, verification, etc.”
 
Whitfield noted that Arm has created dedicated teams to bootstrap machine learning. “One of the key focus areas is embedded intelligence and online machine learning flows, in which adaptive algorithms learn from new data 24/7 and automatically update engineering flows to improve outcomes. Data collection is standardized across groups and projects.”

Conclusion
A common goal among all chipmakers and EDA companies is to capture a wide range of data per job and per flow, which can be fed back to improve the efficiency and effectiveness of engineering flows. But it also is increasingly important to think about data in relation to many other aspects than just the system it sits in. This goes well beyond just the creation and collection of data. It also extends to policy, security and privacy.

“We deal with a lot of very, very large companies, and they go beyond the most critical level of care about their IP,” said David Hsu, product marketing director of the verification group at Synopsys. “So in the sense of having a strategy or a coordinated view of how we deal with this, it’s always going to be evolving. But it can’t be viewed as, ‘Some parts of the company are going to do this, some tools are going to do this, some other groups are going to do this.’ That’s a no go. The way we want to look at this is actually less about the data and more about what needs to be answered even though customers come in and say, ‘There’s all this AI stuff that’s really cool. What are you guys doing there?’ We feel that’s really not the right question to ask. The right question to ask is, ‘I’ve got this horrendous challenge in X, Y or Z, and there doesn’t seem to be any algorithmic way to solve it because it’s beyond an NP-complete problem. What can you do? This is going to kill me either now or next year, please help.’ Those are the kind of areas that are pertinent to what we’re talking about now.”



Leave a Reply


(Note: This name will be displayed publicly)