Thoroughly Verifying Complex SoCs

With so many variables, verifying complex, heterogenous designs is a task with many tentacles.

popularity

The number of things that can go wrong in complex SoCs targeted at leading-edge applications is staggering, and there is no indication that verifying these chips will function as expected is going to get any easier.

Heterogeneous designs developed for leading-edge applications, such as 5G, IoT, automotive and AI, are now complex systems in their own right. But they also need to work in conjunction with other systems in predictable and provable ways for an expected lifetime and under variable conditions. And this is where things get really complicated, because chipmakers need to pay attention not only to how the design functions, but also to ensure the device is safe, secure, and trusted. It must operate as intended even under the most adverse conditions and be immune from unwarranted or unexpected interference, whether unintended or malicious.

“Multi-billion gate chips and heterogeneous platforms are so complex, containing millions of connections, that simulation is inefficient and insufficient,” said Rob van Blommestein, head of marketing at OneSpin Solutions. “What’s needed is exhaustive technology that can handle the massive number of connections. If we’re looking at AI applications that utilize floating point units, the verification of these designs must take into account how to effectively verify floating-point hardware, including ensuring the complex IEEE 754 floating-point standard is being met.”

For heterogeneous environments that are used in industries where functional safety and security are paramount, it is critical that companies understand the different types of faults and how they impact the design.

“Many of these designs must also comply with stringent safety standards, such as ISO 26262 or DO-254,” van Blommestein said. “Designers must contend with these requirements, as well. And when it comes to security and trust, formal approaches can be used. Verifying if a design is secure and trusted requires proving the absence of additional design logic and the absence of modifications to the design flow. The design must be proven to be fully compliant with its outlined specification and certification.”

Typically, SoCs for wireless, IoT and AI applications are built hierarchically, using IP blocks that may be developed internally by teams or purchased from vendors.

“Consider a case where the top-level SoC integration team is informed by an IP development team that the release of the IP they are about to tape out has a critical bug and a new release is available,” said Amit Varde, director, product management at ClioSoft. “Perhaps the bug is not so critical for this SoC and they are clear to tape out, or perhaps this warrants an upgrade and re-verification of IP used in the SoC. Either way, using a design management system that ties in to a bug tracking system is critical.”

To be considered thorough, verification must include a variety of technologies and methodologies. That requires using a variety of verification tools and methodologies.

“The process may start early with behavioral system-level verification before actual design implementation,” said Bipul Talukdar, director of applications engineering in North America at SmartDV. “It eventually follows through with block-level simulation and block-level formal property verification, connectivity verification, X-propagation verification, and clock-domain crossing verification for designs with multiple clock domains. The final steps include RTL and gate-level full-chip verification, full-chip emulation, system boot on an emulator for early software testing, and FPGA-based post silicon validation.”

Complicating matters, heterogeneous SoCs require even more verification for thermal, mechanical and packaging at the system level. To ensure full coverage, as well as cost/time efficiencies, the verification flow for complex SoCs should be integrated with a standardized verification flow. That includes coverage-based quantitative metrics for signoff, he added.

Size, applications and standards
Size plays an important role in all of this. AI chips are so large they often have to be stitched together because they exceed the limits of a reticle. In contrast, an IoT chip typically is tiny, less redundant, with a heavy emphasis on external communication.

In addition, many of these chips are also highly customized. Standards are just beginning to filter into the 5G market, which should help speed up the design process and add some predictability. But for other markets, standards may exist for functions such as I/O, but not for other parts of a design. In AI, for example, the technology is so new that algorithms are in a state of almost constant churn.

“For 5G, a lot of people are saying, ‘We are standardized in a way,'” said Jean-Marie Brunet, senior director of marketing for the Emulation Division at Mentor, a Siemens Business. “Yes, we are, but there is still some evolution to be had that forces a flexibility in the design implementation and the design verification and validation. Also, what is very different in these large systems is not only that they are big, but they also require a tremendous amount of flexibility in the environment — flexibility in the ability to scale very quickly in size.”

In AI, he explained, you move from a small cluster or small entity to a huge entity very quickly. “When you’ve verified that the core algorithm works, it’s a matter of scaling the algorithm to very large systems to speed up the process. They scale very quickly, so you have to have flexibility in that scaling operation.”

What is also a bit different for those large, complex chips is that system software content becomes very, very important. “For 5G, people think it’s only protocol of communications and there is more bandwidth, so they need to check that they can put this capacity of communication through the bandwidth line and everything works fine,” he said. “But we’re starting to see communication where a Linux boot or Android boot is required. This means you need to look at the interaction of your end software application on your hardware. That is okay if the design is small. If the design is very big, it’s complicated. For automotive, it’s the same thing. You hear about automotive-grade Linux and embedded operating systems in the car that are either very different or centralized around Android or Linux. Because those complex chips are supposed to process a lot of sensor data, they also have to react to an operating system so that some software can be programmed.”

Shifting left
In the past, Shift Left generally was confined to verification, particularly for large chips used in servers. Usage of this concept is now spreading. In fact, Nokia has said publicly that in the 5G space its biggest challenge is Shift Left. How do you Shift Left an entire 5G ecosystem? What does it mean to them?

“It means that the software and the hardware need to be co-verified, and it’s very complex when the systems are very big,” Brunet said. “Automotive is the same thing. If you look at the Tesla announcement of the FSD (full self-driving) chip, it was extremely interesting. Elon Musk basically said this is the best computer chip in the world. He should have said this is the best computer chip in the world for Tesla software. If you look at his presentation it’s really about designing custom hardware that works the best with the desired end application software. So yes, that is fantastic for Tesla cars, but it’s completely unusable for another car and completely unusable for another application. It’s the same thing here with these very large systems. They are customized in hardware with the end software needs in mind, and that interaction is challenging when you have billions and billions of transistors and today. The largest system that customers are using our emulator for is around 6 billion to 8 billion gates. That’s a lot of gates. When you have machine learning/AI chips that are used and you need to verify that the software works, well, it’s really complex to verify and really validate that it does the right thing.”

But whether it is 5G or automotive or AI, there are similarities. It’s still about verification and validation, the latter which is really about software content, and having major OS platforms that need to be able to boot. “With Linux and Android, that means the programmability of the boot code has to be verified so you have correlation between the software and the hardware. And they are big so you can’t run that with simulation, for example, because it’s too slow. So you have to go to hardware assisted platforms, either emulation or a large farm of FPGA prototyping. Those have different value and benefits. But the key thing is really that interaction of hardware and software when the software needs are growing tremendously well, the hardware is huge,” Brunet said.

Verifying AI chips
There also are some significant differences. An AI chip is heavily redundant, because performance relies heavily on parallel processing, which makes verification more akin to a processor than an SoC.

“For AI, you verify what’s going on in the chip,” said Frank Schirrmeister, senior group director for product management and marketing at Cadence. “And you’re verifying like in a processor. You’re verifying structurally, such that, I’m computing all these coefficients in whatever CNN I have implemented. But that’s structural verification, just like processor verification is very different than SoC verification.”

AI is like a moving target, though. “You need to optimize your CNN implementation toward your data sets a little bit, so you verify structurally that this processor works and it computes the coefficients and all that in the way I really want to do it,” Schirrmeister said. “But you don’t actually verify that’s a complete new area or verify the function of the network. The thing which is so curious about AI implementations is that I have this big chip with AI in it, with a set of processes in it, and I verify structurally that they can all talk to each other. If the bits flip, things go into a safe state and all that good stuff. But why coefficients are the way they are really depends on the data training sets. This is why everybody refers to the data sets as a new gold or the new oil in this world. Verifying the training set being the correct one is a whole separate question in itself, and verifying that the AI does not actually do bad things while it’s on it, — basically that it doesn’t misbehave — is very hard to do. You never can train all the cases you’ll ever have. That’s just the nature of the AI. But you now can trick the AI in terms of what’s actually happening outside there, and that is a very curious and interesting thing from my perspective that we’ll have to deal with as vendors.”

All of the major EDA companies point to challenges in verifying and validating AI chips and systems.

“The scalability of the verification approach, and what I call multi-level verification in heterogeneous applications like the AI chips that we see today, is just ever more important to conquer,” said Gordon Allan, product manager for Questa simulation at Mentor. “You need to verify the parts, the connections, and the protocols within the chip, as well as the whole chip. But those are very separate layers of the verification strategy. At each level there’s a desire to provide a platform for the next level. It has to be scalable, and this is where simulation-based technologies and emulation technologies can both come into play at different levels of the chip.”

Allan pointed to Graphcore’s mega-complicated Colossus AI chip as an example. “That device is so complex that you cannot possibly keep it all in your head all at once. You need a team working at different levels, and with tools that work well together across those different parts of the flow. For us in EDA, all of our focus on providing tools that have common standards, common APIs, common coverage, common debug — all of that is coming into play in these kinds of designs.”

Multi-level verification is essential when there are arrays of AI processors, as well as supervisor processors and other components. These all have to be verified individually and together.

“Typically, when we work with verification teams on methodology we talk about verifying the processor unit, verifying the internal functionality of that first,” said Allan. “Then, the next layer for these kind of scalable devices is to verify what’s called a Unit X. We verify the internal of X first, and then we verify the interface between an X and another X. So it’s a pipeline interface, typically. If it’s deep into the IP block and has communication with an adjacent IP block, that is a verification challenge in itself because it’s highly pipelined, and that’s where the bugs normally lie. For example, this could include different protocol activity going back and forth on that interface. For any kind of an array of processing elements like an AI device, that requires absolutely robust 100% confidence in that interface before you can scale up. After we’ve verified the internals of X and verified an X-to-X interface, we now can say we’ve verified that unit, and we can then make a model of that. That we can use as a virtual platform for higher-level verification, for example, getting the software engineers started on some of their layers of the SoC.”

The next layer involves management and performance, which frequently are concerns in SoCs with multiple elements.

“Having verified each element, we now need to verify those orthogonal concerns, and they’re the usual SoC things, interrupts, exceptions, resets, low power modes, even something as simple as reset. We’re now seeing customers wanting to cram more into their silicon. They’re moving away from resettable flip-flops to non-resettable, and that gives them a verification challenge, which we’ve been working with customers on reset verification in that kind of design. That’s just one example of larger scale concerns across the chip that we need to look out for in these complex devices. Then we get into the SoC level, which is really where our emulation solutions come into play, as well as our ability to have virtual platforms based on those building blocks. It sounds easy to say verify this, verify that, but each one of those is a project in its own right. By verify we mean plan, create the testbench, create the stimulus, debug, measure the coverage, close the coverage. And those are areas where we’ve worked to optimize productivity there. Whether it’s providing PSS or graph-based stimulus, that’s a useful way to optimize the stimulus creation, for example.”

Verifying 5G chips
5G adds its own unique set of complexities, in part because the technology has never been implemented on a mass scale and in part because the market is a combination of new technology that is backward compatible with older technologies.

“First, you have a software stack that’s really complicated for the 5G protocol so you have to verify that thoroughly to make sure you actually have a 5G device,” said Johannes Stahl, senior director of product marketing at Synopsys. “Companies are doing this with all means available from virtual platforms to hybrid platforms to emulation. Second, there are higher data rates, so you want to be very efficient in implementing them. Here, people are implementing specific accelerators that become part of their hardware architecture, which again need to be verified with a lot of 5G protocol elements. And then if you want to really do this thoroughly, you can’t actually do it without having a real 5G stimulus, and writing 5G stimulus manually is simply not possible. Writing a 5G stimulus by taking your favorite Matlab model might be a starting point, but in the end, without driving actual 5G traffic into the design, you can’t do verification. What some developers must do here is connect their design and test to a physical 5G tester, and with various adaptations, they connect to emulation.”

While emulation is important there, Stahl said prototyping is at least as important because so much traffic has to be run in the communication system that it should be done at the highest possible performance. “Emulation typically comes in when you integrate the 5G modem into an application processor and then bring everything together. That goes beyond what you do in prototyping. But for the actual core modem block functions, people do them in prototyping and connect those to testers.”

And then the real world application of that technology needs to be layered on top.

“Pre-silicon, you can do block level verification very successfully,” said Gadge Panesar, CTO of UltraSoC. “You can put that together and verify the hardware at the system level. But then you layer the software above that and then you add on top external influences, because the entire thing needs to function in the real world.”

That’s a particularly difficult problem when you’re talking about systems that are “hyperconnected,” he said, because the real-world stimuli that the system sees can come from anywhere within the entire connected realm. It’s even worse with 5G, which is designed to connect not just phones, but lots of different devices that have their own QoS, bandwidth and – increasingly importantly – security requirements.

“In terms of wireless and particularly cellular, in practice we’ve been taking an ‘in-life verification’ approach for quite a long time. Most parts of the cellular infrastructure are deployed, then routinely subject to firmware updates. This is the case both for core components, which are now heavily software based, and for edge and premise equipment, which are often updated over the air, which is also true for wired broadband infrastructure,” Panesar continued.

Conclusion
What all of this points to is that real-world, real-time behavioral data needs to be gathered from these heavily distributed systems, as a matter of course, and with real-world data looped back into the design process for software updates and future chips. And then, all of that has to be checked again to see if it worked as planned.

A classic example of how difficult this has become involves a wireless modem application that was deployed in the field. To improve performance, it was updated with an improved channel estimation algorithm, which in turn caused a bunch of unforeseeable timing issues between the processors within the main SoC under certain channel conditions. That led to multiple cache misses across the processors, which eventually broke the entire system. The whole scenario was unforeseeable without running the system in the real world — and today’s systems are even more complex and interdependent than this relatively simple example.



Leave a Reply


(Note: This name will be displayed publicly)