Partitioning Becomes More Difficult

Exploding gate counts, multiple domains, and hardware/software content are making it tougher to verify that designs will work as planned.

popularity

The divide-and-conquer approach that has been the backbone of verification for decades is becoming more difficult at advanced nodes. There are more interactions from different blocks and features, more power domains, more physical effects to track, and far more complex design rules to follow.

This helps explain why the number of tools required on each design—simulation, prototyping, emulation, verification and validation—continues to rise, and why there is so much focus by EDA companies on improving the performance of those tools. But the overriding problem is how to ensure that the sum of the individual parts—memories, third-party IPs, processors and accelerators—continues to meet an increasingly demanding set of specs across a wide variety of applications and process nodes. And to make matters worse, all of this has to be done in a fixed market window, which requires a tremendous amount of forethought and planning because a large SoC rarely fits into a single simulation run.

So what’s the starting point? “Usually new designs can be divided at least into the software and hardware parts,” said Zibi Zalewski, general manager of the hardware division at Aldec. “This began with ASICs, but we can see the same with new FPGA devices, as well. The chips contain processor and hardware functions, and to verify and simulate them the design team must first partition between the software and hardware. This is only getting more challenging.”

This has broad implications for how designs are designed and verified that goes well beyond a simple “shift left” approach.

“With lower nodes, engineering teams are partitioning the entire system and essentially saying, ‘I’m going to have two separate chips, one for my main high-performance engines, and the second for my low-performance engine that doesn’t quite need to be produced over and over again,'” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “There is also an interconnect between them. This is a problem engineering teams are faced with when they have to emulate their design using an FPGA. FPGAs are great for ASIC emulation. You don’t need to necessarily burn millions of dollars to understand whether your system is performing, whether you have any functional bugs, or whether you have any performance bugs. But the difficulty is mapping the entire SoC onto a single FPGA. Therefore it is partitioned into multiple FPGAs.”

The key is being able to determine where to cut. “You want to make sure whenever there’s a lot of connectivity or a lot of communication between different engines that it’s in a single FPGA partition, and then have clean boundaries that you can cut, and extend it across across different FPGAs,” said Mohandass. “There are commercial tools available that say, ‘Here are the three or four different cutlines that let you easily partition and put it into different FPGAs. Here is the approximate gate count so that you can understand if you can fit the entire design into one specific FPGA or if you need two. Especially if the SoC has a concurrent subsystem, it gets really, really difficult.”

There are two main strategies here involving partitioning. One is to simulate as much of the design as possible in a single run. The other is to break it up into different pieces so it can run faster. Which approach is taken depends on whether the team is trying to do functional verification or performance verification, and often both approaches are required somewhere in the design flow.

“In cases where customers are developing their own IP, they have their own secret sauce, they have their own IP and they’re writing it for the first time,” Mohandass said. “They’re worried about functionality, and how that chip works. In that case, what they can do, for example, is instead of running their interconnect at 512 bytes, they can run the interconnect at 64 bytes. So obviously it’s going to give 1/8 the performance but it’s functionally accurate. This lets them have their entire SoC in one FPGA, but it lets them run billions of cycles in a clean way so that they can test functionality. In that case you tend to run it in one single FPGA, but there are other cases where you’re trying to do performance simulation. A classic example is hyperscale networking or hyperscale storage. You’re either successful or not. Whether you’re trying to run 100 Gigabytes of Ethernet traffic or you’re trying to process something at a very high line rate, performance is pretty much as critical as functionality. You cannot afford to chop your design into any bit. You want to represent your design in its full glory. And obviously you can’t put it in one FPGA, so you split it and put it across multiple FPGAs.

Things get even trickier when it comes to simulation at lower nodes, and how to partition the design due to increasing amount of heterogeneity and additional coherency.

“If you’re trying to find coherency bugs using simulation, you need to run it for extremely long cycles,” he said. “The way coherency works is even if you’ve messed up and there is a bug in time zero, by the time the bug actually manifests itself in any meaningful error, it’ll be time 2 million, and in simulation you can only collect statistics for a certain amount of cycles. You can’t say, ‘Hey, great, I have this particular bug in the 2-millionth cycle, let me go analyze it.’ You try to go backward. You have to go all the way to zero. It gets really, really difficult so we have a concept called GCT (global coherency tracker), even as you’re running simulation it says, ‘Yep, you’ve hit your bug in simulation number 2 million, but this particular error was introduced in time 12 and that’s where you need to go look at to see where your bug is. So partitioning and simulation gets interesting. But it also gets difficult. And that’s where the verification methodology and verification collateral from the SoC and all of the different IPs play a big role.”

Planning for a break-up
But this isn’t a simple decision, anymore. It requires an extensive amount of planning.

“Big picture, the obvious reasons to partition designs for simulation are like any engineering problem,” reminded Gordon Allan, DVT product marketing manager at Mentor, a Siemens Business. “You divide the problem up into small solvable pieces. Simulation, by definition, is a software activity that has limitations on capacity, and you want to get the results in a productive manner. So we split the design up into chunks. One of the things we do in verification is plan ahead and deliberately verify pieces of the design from the bottom up. We will change our approach as we continue on our way up towards the full design.”

That doesn’t necessarily imply partitioning, however. “It’s like being intentional about the partitioning ahead of time, said Allan. “That’s where traditional approaches to verification and reuse come into play, where we start with block-level verification of a self-contained piece of design that we could verify in detail, and then we go up a level to the whole chip or perhaps to an intermediate subsystem of the chip, and again draw the boundaries of that partitioning along the lines that represent the spec of the end product. For example, you could have an end product, which is the video and graphics functionality, and there might be three or four blocks that you could verify within that subsystem. Then you bring them together, they operate together as a whole, and you might verify that subsystem.”

The reason for this multiple-step approach is that the verification approach can be different at each level, Allan noted. “There are things that you verified at the low level that you can now take as a given one level up, and that lets you do some optimizations in the verification, such as swapping out pieces of the detail for a more abstract version of the same functionality that you already verified. That’s a technique that’s been used for decades to streamline the verification process. And partitioning is a good way to describe the process of that planning regardless of how you’re going to verify this complex chip.”

If a chip combines video, networking, audio and memory under one protocol or interface, all of the blocks need to be verified together together at some point. “What are the aspects of the interconnect, and what is the interplay between those sub-systems, like dataflow and interrupt flow across those multiple subsystems? At each level of verification, you’re asking a very simple question: ‘What are we verifying?’ In some respects, it’s the most important question to ask. What are we verifying at the block level? We’re verifying behaviors of the block. At the chip level, we’re not doing that block-level verification anymore. We’re verifying interconnectivity, low-power concerns across the chip, data and interrupt situations that happen across the chip—those kinds of things. It’s the things the bind the modules together, pathways across the chip that perhaps go from A to B to C and then back to A again, that have to be verified at the chip level. It’s the same for low power and connectivity,” Allan said.

Further, no matter how a chip is partitioned, there will be a resulting boundary between those partitions. That requires much more focus on that boundary to make sure that subsystems play well together.

Understanding workloads
Traditionally, planning for partitioning—particularly to ensure that a design actually fits on an FPGA-based prototype—was done using spreadsheets. That has become increasingly unwieldy at each new node.

“They would add up the numbers such that, ‘We have this set of tasks, they use so much time, and this is roughly how much MIPS/instructions per cycle we need to execute for each task,” said Tim Kogel, application engineering manager for virtual prototyping in the verification group at Synopsys. “‘We know how long it takes and how much bandwidth they require.’ Then they try to add it up. Now the problem is that with the increased complexity of SoC designs, that doesn’t work anymore.”

But how to approach designs depends on what the engineering team is trying to accomplish and the methodologies developed by chipmakers. Juergen Jaeger, product management director at Cadence, said one of the keys is to start with the core functionality of the design.

“The core functionality typically is to bring up the processing core or CPU cores together with the memory subsystem and maybe one key external interface, like a JTAG or a UART, so they can access it,” Jaeger said. “This, of course, goes through a lot of a simulation on the block level. Then they do verification on the subsystem level, many times in parallel. At some point they integrate everything, which includes running software. Due to performance reasons, the full system often is done either on an emulator or an FPGA-based prototype system. Prior to that, you need to do a lot of verification and simulation, with a simulator, so that the subsystems are pretty stable before you expose them to the higher-performance hardware-assisted verification platforms.”

This daunting process usually begins with debugging the design, but very quickly engineering teams are then challenged to add in software and validate the software interacts in the appropriate way with the hardware. “It’s no longer just the question of getting RTL or design debug,” Jaeger said. “You really have to deal with this whole system—hardware, software, firmware, bringup and validation with complicated things. Still, teams really follow the structure or the boundary of their designs when they partition the design for verification or just for simulation. And then they really go through a structured or partitioned stimulation approach. But at some point they integrate everything together to the whole design or the total functionality.”

Beyond just determining the best place to partition, engineering teams will also re-run a simulation with a different partition. Jaeger noted this is because no matter how thorough and how carefully and comprehensively subsystem or blocks are verified, the more difficult-to-identify issues are often on the boundaries between blocks. “That is where you get into the scenarios to combine different blocks together, look at how they play together, and what the potential issues are when you do that.”

Thinking big and small
So where exactly do you draw the partitioning lines and when? This isn’t always obvious. Because the elements in a large SoC require different tools, such as a virtual platform for software modeling and RTL simulators for hardware modules, as well as debugging tools, the solution to integrate those two separate domains can be the co-simulation of the virtual platform and RTL simulator, which allows to integrate the verification platform of the partitioned elements of the design, said Aldec’s Zalewski.

Running in the background, however, is the clock to get a design to market on time and on budget. That’s where speed of simulation comes into play, and if the design size is large and the involves complex blocks and subsystems, it has to be partitioned into smaller pieces.

“Partitioning of the design modules to simulate separately becomes a necessity, not a choice,” Zalewski said. “Even module-level testing might require hardware acceleration solutions to improve the speed of simulation. In such cases, testbench reusability is most important for seamless migration between the simulator and accelerator. For big designs, the natural way to improve the simulation process is the emulation path, where your focus is to partition into emulator hardware.”

But at the end of the day, there is no single way of accomplishing these tasks, and no right way to partition everything. And as design complexity continues to increase at each new node, those lines become even blurrier.

Related Stories
Prototyping Partitioning Problems
Gap widens between increasing design complexity and FPGA capabilities, making this a lot harder than it used to be.
Partitioning For Power
Emphasis shifts to better control of power when it’s being used, not just turning off blocks.
Tech Talk: TCAM
How to save power and reduce area with ternary content addressable memory.



Leave a Reply


(Note: This name will be displayed publicly)