Requirements For Datacenter-Ready Emulation

Demand for more efficient emulation utilization is growing.


It’s time to look at what the latest trends in emulation are and to review some of the key requirements to make it datacenter-ready. Specifically, I will look at virtualization of external interfaces as well as emulation throughput, specifically the allocation of jobs into emulators.

One overarching trend in verification lies in the connection of the engines in what Jim Hogan has dubbed the continuum of verification engines (COVE). Emulation must be part of that continuum, and I outlined some key connections recently in my post “Top 15 Integrating Points in the Continuum of Verification Engines“. The six key connection points that I outlined for emulation to be connected to simulation were simulation acceleration, simulation/emulation hot-swap, virtual platform/emulation hybrids, UPF/CPF low-power verification, and coverage merge. I also outlined the need to connect to implementation flows in “Why Implementation Matters to System Design and Software“, focusing on dynamic power analysis connected to our Joules power estimation flow.

Then there is emulation’s neighboring engine, FPGA-based prototyping. Its main use model is for software development, but with bring up being a concern, we have been pioneering “multi-fabric compilation,” using the same front-end flow to bring up designs in both engines. This flow gives FPGA users the option to efficiently trade between fast bring-up using automation and prototyping speed using manual optimization. And with that optimization, FPGA-based prototyping also can be used for verification regressions that don’t require the debug offered by emulation or simulation.

With FPGA-based prototyping being single-project focused, the key trend in emulation is multi-project use for system-on-chip (SoC) design, covering all aspects from verification of IP, sub-systems and SoCs, to the actual system, including both hardware and software. It’s about execution of thousands of verification workloads of different sizes and lengths. Emulation is really a compute resource, and is what the industry means when there is talk about moving emulation into the datacenter, a trend we started back in 2010 when we debuted the Palladium XP platform as a “verification computing platform”, a term that, while copied by others multiple times, has never quite stuck.

Figure 1: Emulation Throughput Loop

The key trend here is really about “emulation throughput” as a combination of compilation, run, execution, and debug. I had written about this a little while ago in “Towards A Metric To Measure Verification Computing Efficiency” using an example from AMD. A more generic graphic is above.

Compile is well understood. Processor-based emulation excels here at 70MG per hour speeds on single workstations because it avoids the FPGA-based routing that typically needs server farms and can take days for each turn. The actual run is dominated by speed, typically in the MHz range. FPGA-based emulation systems sometimes claim higher speed in exchange for manual optimization. Debug is key. Processor-based emulation again excels here as it in contrast to FPGA-based emulation systems does not slow down when debug mechanisms are inserted.

Allocation is one key differentiator that really makes emulation datacenter ready. A typical queue of 1,000 verification workloads may comprise 500 IP blocks of 10MG size, 300 sub-systems in the 70MG range, and 200 SoCs with synthesizable testbenches in the 150MG range. Each workload needs to be compiled, may have different length, and then needs to be executed, which is where the granularity of the emulation system plays a key role.

Figure 1 shows how the Palladium XP platform, with its domain granularity of 4MG, can tackle jobs sized 1MG to 512MG. On the x-axis I show the job sizes, and on the y-axis I show the utilization and the number of parallel jobs. Utilization is marked green when it is actually used to run a workload, red when capacity is “locked” because of a not fully utilized domain (a 5 MG design will use two domains and leave 3MG in the second domain un-used), and yellow for the capacity available for other jobs.

Figure 2: Palladium XP utilization for various workload sizes

The bottom line is that for small jobs, the number of parallel executable workloads is very high, and for larger workloads there is lots of open capacity left for smaller workloads. Now imagine, in contrast, emulation systems that have caps of 9 or 32 maximum parallel workloads with granularities of 60MG and 16MG for the same configuration. The allocation and resulting workload efficiency is much lower. Granularity is key and the example set of 100 workloads can be executed multiple times faster because of it.

Besides the very specific physical requirements that datacenters pose on emulation systems—temperature, power consumption, rack size, etc.—a key requirement is remote accessibility, commonly referred to as virtualization. Remote user access never has been an issue for the Palladium XP platform, and most of its users have set up what one would call a “virtual private cloud” for emulation already. With our quick-cycle program, we are even hosting remote emulation access from our facilities for some customers.

Everything is easily self contained within the datacenter (and remotely accessible) as long as the peripherals to which the design under test connects are modeled in simulation (simulation acceleration) or are part of a synthesizable testbench. (I reviewed all the options of connecting real-world data to emulation a while back in “When to Virtualize, When to Stay in the Real World“.)

A key requirement in the context of virtualization of peripherals is that customers absolutely need both a transaction-level interface for fully virtual peripherals as well as connections to real components, i.e., access to rate adapters we call SpeedBridge adapters. The former is focused on software development and allows injunction of specific errors, while the latter is focused on full fidelity and real-world stimulus. Both are remotely accessible as even the rate adapters are host connected. The key difference is speed. We have seen situations in the networking space in which this acceleration with a virtualized system environment was more than 10,000X faster than in pure RTL simulation. While this is impressive, the actual “real” physical connection was even 1,300X faster than that!

The move of emulation into the datacenter was started in 2010 with the introduction of the Palladium XP platform as a compute resource we called a “verification computing platform.” Even better workload allocation and virtualization of all external connections—even the physical ones—will contribute to the next step of fully datacenter-ready emulation.