Partitioning Drives Architectural Considerations

Experts at the Table, part 2: Biggest tradeoffs in partitioning.


Semiconductor Engineering sat down to explore partitioning with Raymond Nijssen, vice president of system engineering at Achronix; Andy Ladd, CEO at Baum; Dave Kelf, chief marketing officer at Breker; Rod Metcalfe, product management group director in the Digital & Signoff Group at Cadence; Mark Olen, product marketing group manager at Mentor, a Siemens Business; Tom Anderson, technical marketing consultant at OneSpin; and Drew Wingard, CTO at Sonics. What follows are excerpts of that discussion. To view part one, click here.

SE: What are some of the technology and business drivers for partitioning?

Kelf: I was in a meeting not that long ago with a big baseband company that was designing an acceleration chip. They were doubling the number of basebands they were putting on this big base station, and when they just increased the capacity of a chip like that they found all kinds of opportunities that weren’t there before for partitioning. So not only were they moving the software around, but they started saying, ‘Well, actually, we need two modulators but four turbo error correction pieces, etc.’ They were moving those blocks around on these basebands, which they were now able to do. And then, having got the hardware moved around, they were then back in the software and could change the protocol stack to make this happen; adding just one instead of two processors to sort of run that. So by increasing the capacity of something it actually often leads to repartitioning even of existing blocks, and that leads to a huge verification problem, too.

Nijssen: Partitioning means to separate. You cut something in two or three, and every time you do that you introduce a hurdle at the optimization horizon. Maybe two and two together is not as optimal as if you had been able to put it all together as one. Every time we have to also study the performance impact, the implementation impact, the testability impacts. The list of criteria is very long and very multifaceted, but there are also a lot of contradictions in there—things that, for the purpose of testability, should be partitioned this way. And for the purpose of implementability, with the best area, it needs to be partitioned that way. For purposes of power, I need to partition it yet another way. So this is now becoming a multi-variable, multi-objective optimization problem. In the end, after you’ve made some choices, you have to work with the IPs that are available to you because you don’t start with a blank sheet. You can’t come straight through the middle because probably they no longer work if you try to do that. Or you couldn’t verify them. So we have to partition systems in yet another way. One thing that has kept me busy for the last few years really is where I have this ASIC I want to build and I can not afford to put everything in an FPGA because it would be too big and not fast enough, use too much power, and so forth. But I can’t put it all in an ASIC because I need some flexibility, like the baseband example where there are standards that just keep changing. If you try to build out an ASIC, that would be really bad news because you’d probably find your ASIC is out of date by the time it comes out. There’s another partitioning in terms of what kind of flexibility I need, and do I now embed some FPGA logic in my ASIC to make sure that I have the flexibility so I don’t have to re-spin my ASIC when somebody changes the standard. Or in the case of bitcoin mining, if someone changes the hashing algorithm and I try to make an ASIC out of it, what do I do? If you keep it programmable, or at least partially programmable, then you can adapt to those things. So now not only do we have to think about all the aspects that were just mentioned, we also have to make sure that the ASIC can be future-proofed by adding programmable areas in there. That means I have to sometimes carve out some piece from the CPRI (Common Public Radio Interface), for example, to keep that flexible. But now my whole design changes, I need to re-verify it to make sure that I could insert some flops in there for pragmatic purposes that weren’t there before, because otherwise I can’t close timing. What it comes down to is that partitioning is an iterative process, where sometimes you have to go back and change one block because another block was changed. It’s not like it’s very isolated. It’s not like when you cut through the middle, then now everything becomes totally independent. If there are cross-dependencies between the various partitions, that makes it into a really interesting problem.

Anderson: Partitioning for flexibility is another dimension. And yes, having the FPGA option is also true. But even with traditional hardware-software tradeoffs, that’s also one of the factors, and the reasons you choose to implement something in software or in the microcode, let’s say, would be the flexibility. If the bus standard is not finalized, you want to get your chip out, and maybe they’re going to tweak things a little bit so put some of that in software. But yes, having the FPGA option to actually have flexible hardware, that’s something fairly recent and kind of interesting.

SE: What’s the No. 1 reason to partition designs? And does it help with verification and reuse?

Wingard: Divide and conquer. It’s about managing complexity.

Metcalfe: Using IP reduces the complexity a lot. It’s pre-verified, it’s already there, you know it’s going to work. It reduces the complexity of what you are changing. So if you’re using a core, that core has been used by a lot of customers. It’s pre-verified. You know what those profiles are going to be.

Wingard: You don’t have to do reuse in order to get a benefit from divide and conquer. The world’s most famous microprocessor design company does massive partitioning as they refine the different pieces which make up that microprocessor. They break it all up into pieces, and they’re very careful about the boundaries between those things. Some of it’s for verification and some of it’s because they’ve got hundreds of millions, or billions of lines of legacy code they have to run and be compatible, etc. So, it’s not about reuse. It’s about trying to manage a team of thousands of engineers working on a project and have some belief that you’re going to be able to stick to the schedule.

Anderson: And power and performance and all of the other things just mentioned. There are about six different vectors here, all of which can be helped by good partitioning.

Nijssen: The rest is about granularity. Right now we’re routinely doing building blocks that are so much larger than, say, 10 years ago, and certainly would be partitioned for all sorts of reasons. Now we say it’s more efficient or quicker, or tools can handle it, so just glue it all together and do them as one block. The granularity of the partitioning is also changing all the time, and it is moving fairly quickly.

Metcalfe: That’s a really important point because, from the physical perspective, partitioning is costly. As soon as I insert partitioning, I have to have an I/O. I have to put pins on it. I have to freeze the block, so it’s not for free. The granularity is really key to the system designer because they don’t want thousands of small blocks unless there’s a really good reason. Sometimes there is, but then you need an enormous design team to implement these thousands of things. You don’t want anything too big, because then at run time these things can start becoming a constraint. So there’s this very narrow range of the sweet spot when you are partitioning, and it considers many things. Granularity is a big one.

SE: What about performance then? Are there tradeoffs to be made that will improve performance?

Wingard: To me, the most fundamental thing is the cost benefits that come from many forms of integration. It’s certainly everything we call SoC. Those benefits come from pooling memory, from sharing an expensive resource such as off-chip memory. It’s getting more functionality out of the same relative cost of memory. So that’s inherently a performance tradeoff. You are always balancing the cost you can afford the system to require versus the performance that is needed to be delivered out of that memory in order to serve the applications you want to target. It’s an inherent tradeoff. An architect who doesn’t consider that as part of their partitioning choices is setting themselves up for failure, and in the early days of SoCs — and some of us are old enough to remember that — there were plenty of people who thought this is just an ASIC with a microprocessor added to it. Those people ended up recognizing this is different. This isn’t about just achieving a certain frequency level on all of the flip flops between the designs. There is a system performance requirement that you didn’t have to worry about before. It really just comes from pooling memory.

Ladd: That’s a key job of an architect. If he has to increase performance, he’s going to have to trade off something. He’s going to increase power, he’s going to have to maybe go to a new core. It’s going to be more expensive or it’s going to be more area, and there will be some tradeoffs that have to be made. There’s always a path to get more performance, but what are you willing to trade off to get it.

Metcalfe: And you go through a lot of what-if scenarios. Let’s say you have to integrate a few megabytes of memory into the design. There are a couple of options. I can try to keep it on-die in my ASIC. So 8 Mbytes is probably still doable within a reasonable cost. But at some point you may consider putting it outside as an RLDRAM (reduced latency DRAM) or something like that. What impact does that have on the performance? What impact does it have on the cost? Let’s say you just ran out of pins on the package, so that’s the reason I consider this. There are a lot of what-if scenarios that are noncontinuous. It’s not a smooth optimization landscape. It’s very erratic. If I change ‘this,’ then all the other things have to change, too, or my system performance goes down the drain, or the cost goes up, or the board designer is not going to like me anymore. So there’s a lot of these very erratic things that make it very unsuitable for software optimization. As I was listening to the discussion, I was thinking, why isn’t there some software tool that does this for us?

Kelf: This brings up the whole verification discussion. So what is verification of these things? You’ve got to get the thing working. That’s the basic level. There’s safety, and there’s all these different parameters that you’ve got to check with verification before the chip goes out. And one of the biggest is performance for the reasons everyone has been saying. When you run these verification tests at the SoC level, a lot of it is profiling just to make sure that you haven’t got any funny bottlenecks you didn’t expect. ‘Some block here isn’t blocking something over there. Are the communication paths working correctly?’ And so the verification task is changing a lot. The architect is to set up what they think is the right partition at that level, and then that has to be validated from a performance, profiling and power point of view using verification before the chip goes to fabrication. This is now a major part of SoC verification, and it’s a huge problem. A lot of companies will find when they run these verification tests that the thing works, but it does this weird thing, and they have to go back and actually redesign a major part of the device.

Wingard: That’s what we’ve been doing for the last 20 years, and it’s becoming more obvious that it’s a problem. Our approach was to attack it with architecture. Every product we’ve ever built has been based on hardware virtual channels, because what they allow us to do. So you can mix the higher priority and the guaranteed throughput, the best-effort stuff, and have perfect knowledge about how they’re going to interact. And so our view is we want that SoC-level performance verification to be something that can be done by rote. We don’t want to be discovering things there. We want it to be designed in.

SE: But that doesn’t always happen; people don’t always take that approach. This is why partitioning is such a big deal, correct?

Wingard: You’re right, there are always ways to hurt yourself: it’s that you can pay me now, or you can pay me later. If you look at the user interface that we present to our customers, it starts off by creating a performance test bench. That’s the first thing you do because what we learned early on is that we can automate hookup, that’s fine, but that there was no way of reasoning about the chip without a performance model. You couldn’t size a buffer, you couldn’t set up an arbiter, you coudn’t do anything. You needed to performance model in order to drive the choices. So from our perspective, it’s what you should do first, not last.

Kelf: You definitely do it first, but you go back and just validate it. If you rely on that end validation to get the performance right, it would be awful, and no one is going to do that. But, just that last bit of validation, trying to find some weird corner case where the processor is driving three things when it shouldn’t be at the same time, for example, that just loads down the whole system, then all of a sudden you’re on a cell phone trying to send a photograph while also typing a text message, you find the interaction just blows the whole thing apart and you didn’t expect it. This is the bigger problem. And the SoC is getting bigger, getting more complex, those kinds of scenarios now are getting tough as well.

Olen: When it comes to performance and always trying to get to higher performance, what the EDA vendors have encountered is that for 10 years or more, we were able to produce faster processing engines, faster simulators, all three of [the largest EDA tool suppliers] by riding on the backs of the microprocessor companies who were cranking up clock speeds. Well, if you look at the trend data over the last three, four or five years, it isn’t happening anymore, so how do you continue to get higher performance in the verification world? We partition. We’re all going down that path right now of how to parallelize what you’re doing and at least what we’re hearing from our customer base is that the way to make that actually work is clever design partition; that you can’t just take any general purpose design and magically run it on a parallelization engine. There’s GPUs this, CPUs that. There’s just different ways to put the test bench here and you put the design there. There’s all kinds of things happening now as well. So we’re partitioning to try to get more performance because we’re not getting faster clock speed, so you either have to do things in parallel or you have to go to a higher level of abstraction which is actually not performance, it’s just calculating less and making the decisions. Even in the verification where everything is driven by by partitioning, even with the design IP type of things, and if you didn’t have the ability to change your partitioning during the process, then you’d have no differentiation because everybody would be buying standard third party PCIe controllers and Ethernet MACs and AMBA architectures, and then there’s no differentiation. So everyone’s looking at doing certain things in software or change this or change that.

Ladd: That drove the multicore design philosophy. People who aren’t getting the increase in clock speed every year, they had to go to mutlicore.

Olen: And that actually drove in turn the whole test bench automation industry because they’ve done behavioral studies that humans can’t do multithreading in their heads. So you had to go to test bench automation to be able to do multiple sources of stimulus synchronously or asynchronously; you couldn’t just write directed tests anymore.

Nijssen: That leads to another trend — heterogeneous computing — where you see more and more different, specialized blocks in the design where previously you would have tried to do it maybe with a CPU, but now you’re going to have this H.265 block as a separate block and you have different kinds of accelerators, be it GPUs or FPGAs in your system, and somehow now the system partitioning from the top down again needs to map these different tasks to the best blocks. If there is hardened task partitioning, that makes it even more complicated that the performance analysis is like, ‘If I have a CPU that’s running this task, and now it needs to do a matrix multiply or let’s say 1000 x 1000 x 1000 but am I going to keep that on my CPU or do I pay the penalty for transferring this to this FPGA block that’s then going to do this for me and then transfer the whole shebang back to a CPU and continue? That analysis is incredibly complicated and you need to have very carefully crafted testbenches with realistic vectors and use models to make sure that not only does it work very well on average case, but that can’t be the worst case. And how do you know with such a complex system where all the different players are different, how do you now know that there still won’t be some scenario in which these guys are colliding or will hit some kind of deadlock?

Related Stories

Partitioning Drives Architectural Considerations

Why Parallelization Is So Hard

Partitioning Becomes More Difficult

More Nodes, New Problems

Leave a Reply

(Note: This name will be displayed publicly)