Hybrid Emulation Takes Center Stage

Complex chips require a multitude of verification platforms working in sync, and that’s where the challenges begin.

popularity

From mobile to networking to AI applications, system complexity shows no sign of slowing. These designs, which may contain multiple billion gates, must be validated, verified and tested, and it’s no longer possible to just throw the whole thing in a hardware emulator.

For some time, emulation, FPGA-based prototyping, and virtual environments such as simulators have given design and verification teams options when it comes to making sure their designs function properly. Now, because of highly competitive market pressures and system complexity, these technologies are being brought together in a variety of new ways to tackle the enormity of the system verification challenge.

The place of emulators and RTL simulators in electronic design reaches back decades. “With RTL simulators, people are checking that the designs they’re going to fabricate function correctly,” said Simon Davidmann, CEO of Imperas. “They model it in Verilog, they get out RTL, they run it through a Verilog simulator, and it runs unbelievably slowly—maybe hundreds of hours to get nowhere right in one simulation. It takes weeks, and nothing happens because it’s too slow.”

Years ago, someone had the idea to run the Verilog in dedicated hardware, rather than running it on an x86 CPU. Today, Cadence, Mentor and Synopsys emulators take Verilog RTL in and allow it to run. These machines can handle billions of gates, but they cost millions of dollars and they’re considered a precious resource in hardware teams.

“As engineering teams try to get software up and running on their RTL before they commit to the silicon, they run into the problem that it takes too long to get to the interesting bits,” Davidmann said. “For example, let’s say there is a multi-core processor running SMP Linux in a hardware box with a complex graphics subsystem or a visual pattern recognition system for mobile. This type of system has a tremendous amount of RTL, so when the engineers want to boot the operating system to get to the interesting bit until the hardware accelerators come to life, the problem is that to boot a Linux or an Android is billions and billions of instructions. To get that to boot, where the microprocessor is RTL in the hardware emulator, takes a day of execution on the emulator—literally 24 to 48 hours to get to the part where the interesting bit of hardware comes to life.”

This is where hybrid emulation comes into play. When the engineering team wants to take the processor RTL out of the emulator and run it in a simulator, it relies on having a model of the processor. These can include fast and accurate models from Arm, MIPS, and open-source models based on the RISC-V ISA, among others.

These models have existed for some time. What is changing is that engineering groups are trying to use emulators on larger chunks of RTL and they are running into the capacity wall of a hardware emulator. So the engineering team tries to work out which bits of RTL to leave out of the emulator, but it also must contend with the third-party processor, which is pre-verified and doesn’t need to be emulated.

If the design was just be simulated, it would run a lot faster. So part of the motivation in moving to a hybrid emulation approach is to get more capacity for the RTL in the emulator. As the RTL gets bigger, there may not be enough space for the processor, so it could be simulated as a fast model in a simulator on a virtual platform, and that frees up space in the emulator. AMD, Intel, Qualcomm, Nvidia, Samsung, and STMicroelectronics all are invested in this approach, according to multiple industry insiders.

Using hybrid emulation
At the highest level, hybrid emulation is defined by part of the system in a host server, with the other part of the system in emulation or prototyping hardware. Case in point: Samsung worked with Synopsys on a hybrid scheme to migrate software development, which had been done post-silicon, to a pre-silicon method that would scale and that software developers would adopt. The company tried emulation, which had the advantage of not requiring a change of hardware and the software could run unmodified on the model. The problem was that it was too slow, so it swapped over to a hybrid scheme.

“Was this simple? No, it was not,” said Johannes Stahl, senior director of product marketing at Synopsys. “We went through a learning process over quite some time on both the Samsung side and the Synopsys side, and part of the learning process was to bring the teams together that were involved—the people who created the emulation model, the people who used the hybrid on the Samsung side, the software guys, and Synopsys as technology provider helping with all the methodology and flow.”

That appears to have paid off. For the last three years, Samsung has been using hybrid emulation in production design flows to run Android pre-silicon, bring up Android, go through the boot sequence, and make sure all the services come up and that all the peripherals on the mobile SoC are being utilized correctly. After booting the OS, a few applications are tested out to how the system is performing.

Samsung uses all prototyping, simulation, virtual environments and emulators from all of the providers, with the configuration depending on the specifics of each project, according to several sources.

The good and bad of hybrid emulation
The definition of hybrid emulation is still flexible, given the variety of use cases and technologies involved.

“When you have cores that are processors by nature, you don’t have the full RTL, and you are using an abstract model that runs faster, it is hybrid,” said Jean-Marie Brunet, director of marketing for the emulation division at Mentor, a Siemens Business. “It could be co-emulation, co-simulation, you can accelerate from an emulator box to a server being on the side of the box, and have transactors, etc. That could be done hybrid or non-hybrid. The real definition of hybrid is, ‘I don’t have everything at the RTL level, but I’m using some abstract model, which can be an Arm fast model or a QEMU model. As soon as I’m using a model that is a representation of the core, then it is hybrid.”

Brunet sees a massive increase in usage of hybrid emulation for a variety of reasons. “First, they don’t have the RTL of the core, they have an abstract model,” he said. “By virtue of having an abstract model, they run much, much faster. When you run a complex mobile chip on a hybrid emulation where there are multiple cores, you run on hybrid an order of magnitude faster in clock speed than you can run everything else. In traditional emulators, when you run something, you are between 1 and 2 megahertz, but when you run on hybrid, you are running between 40 and 50 megahertz. That’s the value of it. And since you run faster, you can put more cycles in it and therefore run more software application type of things.”

Still, there is no free lunch. “As soon as you run faster, it’s because you’re running something that is not RTL,” Brunet said. “So you need to have a model that is a pretty good representation of the functional behavior of what you’re abstracting here, but it runs fast enough to be to be usable as that type of clock speed requirement.”

Applications
Typical users of hybrid emulation include design teams trying to run application software on a design. That includes everything from Boolean code all the way to Linux and Android, and running some application. “When you parse that level of sophisticated software stack on the hybrid emulation netlist, you need to adjust drivers, you basically need to have the hardware and the software that talk to each other. The hardware at this point is not the full RTL. So why do you have an application to run on drivers and bus model functional situations? So that everything works with each other. And when it works with each other like this, you get good speed,” Brunet explained.

Hybrid emulation is used for mobile device design very early on to do architectural exploration to determine if it is the right architecture for the software that is trying to run, or whether there is the right behavior from a power perspective. “With a hybrid emulation setup we are able to see that the software is providing such a stimulus to the design, it is running it in a hybrid environment, so it runs much faster. I profile the power consumption of software for the design, and I can perform, at that level, fast trend analysis. When some behavior is selected to be correct on the software and the corresponding hardware, then the design goes slowly but surely to RTL. At this point you have less hybrid and more RTL on the emulator, so you run slower but become far more accurate. This allows the design to gradually converge towards an environment where hybrid is no longer needed since you want full RTL emulation,” he added.

In effect, hybrid emulation is a mix of the abstraction levels of RTL and transaction level as part of a verification stack. “The intent of hybrid emulation is to allow you to take away the items which you do not need to run at full accuracy in emulation, and run them faster, with two main intents,” said Frank Schirrmeister, senior group director, product management & marketing at Cadence. One is what Tony Smith at Arm calls Time-To-Point-of-Interest (TTPI), which is essentially booting up Android or booting up Linux or any heavier operating system to the point at which the interesting portion starts — to simply get to that point faster.”

The second intent of hybrid emulation is to run the actual test software faster, he said.

“In presentations from Nvidia at DAC and other conferences, what they essentially have done is keep the GPU in the emulator, because a GPU is traditionally hard to simulate given its parallel nature,” Schirrmeister said. “So it can’t really be abstracted like processor models. As such, the GPU is kept in the emulator at full accuracy. For the rest, such as the processor subsystem which talks to it, you build a virtual platform. The key piece is the connection between the two worlds. If you would synchronize every clock cycle you would slow it down to the slowest component in the system, which in this case is the emulator. You have to bear in mind the emulator runs at around 1 to 2 megahertz so it’s slow in comparison. While it’s still better than pure RTL simulation, it’s much slower than a virtual platform. So the idea is to have a flexible synchronization mechanism between the two [Cadence calls this smart memory], which is a cache to communicate between the virtual world and the emulation RTL world. The virtual world at transaction level and the emulation world at the signal level/ RTL level communicate through this memory and are allowed to run independently until, for instance, the hardware needs to access a memory region, a slice of the memory, which the software has manipulated recently. In order to do that, there must be fast synchronization between the memories to basically transfer the data between the virtual world and hardware world, and keep them both in sync. But until that happens, they can run independently.”


Fig. 1: Types of hybrids. Source: Cadence

Mix and match
The earlier on in a design cycle that you can begin to look at software architecture and the underlying hardware you are defining, the more informed decisions can be made.

Silexica starts with software analysis of an application, then performs high-level performance estimation based on SHIM models (currently being standardized in IEEE), according to Luis Murillo, vice president of engineering at Silexica. “Next, we move into lower levels of abstraction by generating annotated software models to perform architecture exploration in Synopsys’ Platform Architect (PA Ultra) simulation framework. Once the design has been more defined, a silicon emulation system can be used for chip prototyping. Starting at a higher levels of abstraction enables architects to begin to make informed architectural and application decisions very early in the design process and help avoid costly architectural changes later in the design cycle.”

One of the more interesting things about hybrid emulation is the variety of ways different technologies are pulled together.

“To solve a major problem hardware and software engineers face when collaborating on the verification of their respective design parts, specifically that hardware is typically verified out of context with the software, and vice versa,” said Zibi Zalewski, general manager of the hardware division at Aldec. “For example, software engineers verify on models of hardware, which can be very expensive and which run slowly. On the premise that the SoC or ASIC under development is likely to contain an Arm core, we simply said, ‘Let’s share the one in the Xilinx Zynq UltraScale+ FPGA on one of our high-end boards that is a platform, which itself supports the co-development of hardware and software with our hardware emulation platform.”

The sharing was facilitated by an FMC Host2Host bridge and negates the need for the hardware team to purchase RTL code for the Arm cores, according to Aldec. Prototype-level clock speeds also were achieved, which benefits the software team. For example, Linux running on a hard Mali-400 GPU core booted within minutes as opposed to hours, which would be the case under a modeling scenario.

In addition, the hardware emulation platform allowed access to a number of interfaces and I/Os with corresponding drivers for accessing resources that will be embedded alongside the SoC/ASIC under development, as well as interfacing with the outside world.

The future of hybrid emulation
Looking ahead, hybrid emulation is expected to become more dynamic. “We have requests from people who would like to do what used to be possible with Carbon, called ‘swap and play,’ where you essentially start with the fast model, then you swap into the cycle accurate model, and then everything is accurate after you’ve reached the time to point of interest. We are getting questions as to whether we can do this with emulation,” Schirrmeister said. “It’s non trivial to do, because you have the Arm fast model at a higher level of abstraction. It doesn’t model the pipeline. You don’t really have the same content internally within the processor model, although it’s functionally doing the same thing. But the way the internal timing is calculated is different than the fast model. That’s why it’s fast. That’s what was taken away from the cycle accurate model, the Carbon-type model. There’s probably a way to do things in one direction once, but switching back and forth is the next step.”

Another thing that needs to be taken into account going forward is coherency, because if there are different items now talking to each other, coherency comes into play. That means engineers need to be aware of which questions can be answered with any given hybrid emulation setup and which ones cannot.

Also on the near-term horizon for hybrid emulation are AI applications, Stahl noted. “AI is similar to the networking space in that, in many situations, there is an Intel server driving data into an AI chip within the data center, predominantly. Many AI chips look simply like a big compute array with a lot of memory, and on one end of the chip is PCIe input. What customers do in the AI space is put a huge chunk of chip into an emulator, billions of gates, connect it through a hybrid technology that represents the x86 and the guest operating system, whatever they want to run on this virtual host. That is very important for them because they want to mimic the actual programming and driving of the data into the AI chip. That’s going to continue to grow.”

Still unresolved, though, is how hybrid emulation will work with AI, machine learning and deep learning chips, which use completely different architectures than traditional SoCs.

“If you have an Arm core, you can do hybrid,” said Mentor’s Brunet. “This is a traditional model. But how do you do hybrid emulation when the architecture is different? The answer is that not every methodology and not every emulator is equipped efficiently to do that.”

Brunet cites three key factors. “Scalability, because AI/ML scales very quickly in size. Therefore you need to something that scales with no degradation of performance when you scale. Next, there must be virtualization, because those new AI/ML/DL designs rely on benchmark performance, like MLPerf, CAFFE, etc., and everything here is virtual. There’s no ICE, so already, virtualization is key. Third, determinism is needed. This means if I run a sequence of tests a day at 5 o’clock, and I run it the next day at a different time, I will have the same behavior. And if I compile the design, and I modify only a little bit and I re-compile, my compilation needs to reproduce relatively the same logical cloning and things like this. Determinism in this environment is very important, and not every emulator is the same with determinism.”

Conclusion
What drives to the hybrid bottom line is the huge amount of software content and software determining more of the overall functionality for the system. How all of it plays out in the hybrid emulation verification space remains to be seen. System architectures are changing, and so are the tools required to build, verify and debug them. But how these various options are put together may be as individual as the companies and the designs themselves. So far, there is no single roadmap, and that opens up all sorts of possibilities, both good and bad.



Leave a Reply


(Note: This name will be displayed publicly)