Partitioning Drives Architectural Considerations

Experts at the Table, part 3: Systems of subsystems; heterogeneous systems; re-partitioning.


Semiconductor Engineering sat down to discuss partitioning with Raymond Nijssen, vice president of system engineering at Achronix; Andy Ladd, CEO at Baum; Dave Kelf, chief marketing officer at Breker; Rod Metcalfe, product management group director in the Digital & Signoff Group at Cadence; Mark Olen, product marketing group manager at Mentor, a Siemens Business; Tom Anderson, technical marketing consultant at OneSpin; and Drew Wingard, CTO at Sonics [Sonics was acquired by Facebook in March 2019]. What follows are excerpts of that discussion. To view part one, click here. To view part two, click here.

SE: When it comes to complex systems in which all the players are different, how do you know there won’t be a scenario in which multiple system components hit a deadlock?

Metcalfe: That’s a really important point because that happens on a system-on-chip level also. So many times I hear people say, ‘What we need on this system-on-chip is a three Gigahertz CPU. That will solve all our problems,’ but as soon as you have a 3 GHz CPU, the problem often transfers somewhere else. If you can’t get data on and off that CPU, it’s going to be idling lots of the time; maybe a GPU architecture would be better. This fixation on high speed CPUs is only one part of the partitioning problem. You can make one part of the partition very, very quick, but if nothing else keeps up with it, it’s not going to do much so partitioning, relatively speaking against all the different partitions is equally important. You don’t want one thing doing something very fast and everything else not able to keep up.

SE: In a heterogeneous situation, how is that different from a non-heterogeneous situation in terms of the actual partitioning that needs to be done and the test benches that are needed? How do you approach that as an architect?

Nijssen: It’s totally different in the sense that it’s just so much more complex. Let’s say you could make do with one processor, one operator can do it all, then I need to know that one, I need to know it really well, but then I can reason about my system, how it’s going to behave, but if every player on the field will respond in a different way, that makes it for every one that you add, it’s not additive, it’s multiplicative; it gets so multidimensional and it just makes it really difficult to reason about performance, or to reason about the integrity of the system, the robustness, and quite frankly, the correctness of the system.

SE: So then, what do you do about that?

Wingard: The earlier SoCs were all heterogeneous, and they were heterogeneous by necessity. It wasn’t that people didn’t know that they can put down a set of processors to attack the problem; it wasn’t an option. The die would have been too big; it wouldn’t have been able to achieve the cost or whatever. Some of our earliest chips were the original digital TV chips and that transition had an enormous amount of energy processing elements. I had to be able to do MPEG and I had to be able to put out pretty pictures. I had to have a display control and each one of those had their own image processing pipelines associated with them and they had to be stitched together some way and because it was video, the data sets were too big to keep on chip and so they all had to beat the crap out of memory. They had all those problems. Did they have elegant solutions? Heck no, but what architects are good at is abstraction and so you have to come up with sufficiently conservative assumptions about the behavior. If you do worst case assumptions, most of these chips don’t work. So the architect has to do sufficiently conservative assumptions about what the actual behavior is going to be and they provision their system based upon those. And then yes, they come back and take a look at it at the end. But none of these people had the luxury of having software that ran on these chips before they were back in the lab; that software wasn’t ready for a long time afterwards. Were there surprises? Of course there were surprises but the best architects figured out what the best way was to be concerned with. And I don’t think that fundamental model changes much, but now we’ve got hierarchies of systems of subsystems of [subsystems], and of course it gets more complex. What I think we lacked then, and lack now, is a semi formal way of describing these interactions. So I would like to have the equivalent of static timing analysis for performance. I would like to have static performance analysis. I don’t think the math is that hard. I think it’s getting the models out of the creators of the different components, some of which will be IP vendors, and some of which would be a person on your design team who can actually describe, ‘I work okay if I’m going to generate traffic that looks a bit like this and as long as I get responses that look something like that, then I’m okay.’ And then you could automatically validate that that’s true when you build the blocks and then I could build a system level performance model that would be formally correct.

Ladd: Would this be transactional based queuing models that you could use for your performance?

Wingard: The problem with queuing models is they don’t have — because of the performance characteristics of dynamic RAM is address pattern dependent and so you get very different behaviors, not just based upon the number of transactions, but what addresses they use. If the three of us are all trying to generate addresses that all pound on the same bank of memory, the performance is going to be lousy.

Ladd: Couldn’t you add that congestion into the queuing model?

Wingard: You could, if you could understand it, but again, you’d have to get in there.

Kelf: This is where the Portable Stimulus fits in. Portable Stimulus has been billed as the portability of emulation and simulation ( stuff, and that’s all fine, but it’s actually rubbish. What it is, is trying to supply what you just said, I think: Almost a semi-formal way of describing real high powered SoC scenarios at a level that you can drive test, including queuing but not just to think of it in terms of queuing model, think of it in terms of a full scenario model that handles the control and the data at the same time looking a lot at memory interaction, addresses, cache coherency, and all these kinds of things at the SoC level, which allows you to really wring out the SOC by driving C tests and transactions without having to have an operating system running on that thing, or even bare metal real software tests. You can think of it as a substitute — almost — for the real software running on this thing by providing scenario models that generate software tests, and hardware tests at the same time and wring out the SoC in a full sort of model. That’s what we’re trying to do.

Olen: It’s called ‘Portable Stimulus’ but it’s probably one of the most misleading labels because what it actually is is a declarative specification of behavior, from which clever tools can generate stimulus, but it is not stimulus itself. It’s actually a BNF (Backus–Naur form)-based, semi formal description of behavior, and it actually allows partitioning because you could go off and you could write your Arm core description and you could go off and write your PCI Express transaction generator and I can write a USB interface. We could all partition our verification challenge and test all of those things, and then we could bring them all together at the SoC level, and then you could write a high level C code generator at the top that controlled all of this, so none of us has to multithread.

Nijssen: One of the difficulties of individual partitions is that their behavior is becoming more dynamic these days. So the cache example is a very good one. The behavior of timing and performance of this block depends on what was running on different CPUs; it’s workload dependent. Now in my verification, do I have to have integration of all possible interactions of these programs that might be running on a GPU, CPU, FPGA, whatever, and that might be stepping on the same cache line or just basically one mixed hold cache dirty and affects the performance of the other ones where it had certain QoS guarantees to deliver. The problem with that is that the partitioning itself presupposes that I can separate them out, they live in their own world, I can do my thing there, verify them, implement them or whatever, and then put them together and then have a system that’s going to deliver to it’s performances. I think the thing we’re pointing out is that is not true. They interact and they change each other’s behavior or their performance in a very drastic way so the partitioning itself is almost like a misnomer or assess the wrong expectations at least that you can just separate these blocks into different partitions. There’s going to be surprises that if the customer changes their workload. You thought they were going to put this chip in one thing, one way today, and tomorrow they’re going to do it in something else or some other customer comes along or some hacker tries to do a DoS attack — those are all things where people want to have performance guarantees. In datacenters, for example, they want to have 100.0% of Ethernet throughput. No matter what the packet size scenarios are, you have to guarantee 100.0% regardless of what else might be going on. How do you do that? With all these systems whose behavior is now becoming more and more independent?

Anderson: The promise of Portable Stimulus, it’s not the stimulus that’s portable, it’s the model. One of the things I’ve observed as this technology has started to catch on and be used more widely is that the partition people use for the model often doesn’t bear much resemblance to the actual design and that has some advantages. You have a little more flexibility in the way that you’re approaching top level verification and assessing performance and power and all of the other things we’ve talked about. You have to have some model to run to get those answers. You probably could have your architecture C model, which is also probably pretty different than the partition. This is one layer down, able to do more detailed analysis. It’s still not completely tied to the physical partitioning. That’s a level of abstraction that has a lot of value in the Portable Stimulus space.

Kelf: The partitioning of the verification, you do have these individual tests that go with individual blocks, but, when it comes to doing this whole SoC test, you might use a checker on one block to make sure it’s still running right but those SoC tests are very different so now you’re modeling at this SOC level. You can’t really partition that up, you’ve got to think of a scenario that runs across the whole thing and actually what Tom described is almost exactly the sweet spot and where the things are been used the most, which is looking at cache coherency and performance across these partitions. Instead of trying to figure out all the individual tests, and find all the individual corner cases, basically you describe the high level model of the whole thing and then let the tool actually work this model in lots of different ways, find lots of different tests, run it on an emulator so you can run many of them and then try and hit one of those nasty situations.

SE: How does all of this look moving forward, practically speaking?

Nijssen: This is getting so complex. Subsystems within subsystems within subsystems each influencing each other’s behavior. There’s no way you can run emulation on all the different workloads that your customer’s customer’s customer might someday run. The takeaway of this: you’re going to get it wrong no matter if you did everything right so you thought. Maybe this is because it was just not doable; you can’t have this whole universe of interactions going on. So the question is, what do you do? You’ve made your silicon and you can’t wait until you have all possible scenario and combinations that are verified with emulation or other modeling techniques, no matter how strong. You have to release your product, and you’re going to get it wrong. What do you do? The problem is with the partitioning because these effects spillover between partitions. How do you make your system flexible enough that after silicon goes out you will be able to make that change after you learn, after the new workloads become available that weren’t even available when you were designing or even specified in your market requirements document when you got it. The question is how you add that flexibility to your system so you can adapt to this changing environment.

Wingard: As a NoC provider, we’ve had to do that forever. The different scenarios that Dave talks about are something that’s front and center to our customers who have to build multi-mode devices and so we have programmability in our arbitration systems, in our security systems, and things like that, so that you can optimize for modes, but you also can put in margin and decide how you’re going to allocate that margin as the actual software is running on the actual system on the actual board. That’s something we’ve had to worry about a lot.

Metcalfe: Partitioning also has an effect on schedule. We talked earlier about different types of partitioning; from an implementation point of view, designing a chip with 64 of the same thing on it is really different to designing a chip with 64 different things. When the designer makes that partition decision, they’re often saying, ‘I could have this really weird hardware to do this, but if I just do it in a CPU, I can get this done a little quicker.’ That is one of the constraints people face on a daily basis in terms of driving the partitioning decisions.

Wingard: Actually that’s something we find pretty interesting, which is at the top level of these designs, those partitioning choices can be more or less rigid based upon the design flow that the back end team has chosen to use, and we’ve seen examples — because we’re chip people — of, ‘we want everything to be optimized,’ including schedule. As a result, we’ve run into situations where we’ve gone to significant effort to try to build a single on-chip network fabric that can logically span the whole die, but it’s physically going to be partitioned into many different pieces. When we run into a flow that imposes early restrictions where the pin list can’t change, we find a completely different class of results than when we end up with a flow that allows it. I’m fascinated to know, and I imagine you prefer customers to pick flows where flexibility is encouraged because I’ve never seen a compile zero pin partitioning that survived. The existence proof is there that it doesn’t fundamentally work, but still there’s people who seem to want to keep doing it. It kind of surprises me that that would be the preferred route.

Metcalfe: The methodology is very important and that’s very much customer and design specific. But again, coming back to the marketing requirements, you can design anything you want, if you have enough time, but marketing guys are going to say, ‘I need this thing done by the end of the year. What compromise do I need to make to make it happen at the end of the year?’ Well, you’d better not partition that way, because if you partition that way we’re not gonna have it ready by the end of the year. That schedule component comes from forces well outside of the engineering discipline, but it’s equally important in terms of delivery.

Anderson: There’s a lot of stuff that comes in from outside. One thing I wanted to touch on, which has come up a couple times, is re-partitioning. You do your best job, know there’s going to be some iteration, think you’ve got it locked down, then you’re really late in the project and something comes along to make you have to go back and rethink it. Maybe it’s that the competition has a new feature you’ve got to add or the silicon vendor comes back and says they were wrong about some new process. How do you deal with that? What is most likely to screw you up at that stage, and then how do you deal with it?

Wingard: We built our products around generators. My RTL people don’t code RTL, they code Python that generates RTL, and we did that specifically wrapped in the EDA environment because as someone who had integrated big chips before I knew that those things were there. What we measure is how many minutes does it take to get back to where I thought I was yesterday and so we tried to build that technology.

Olen: From the partitioning point of view, what we’ve dealt with — I think all of us — is that system architects tend to focus on design intent and partitioning is largely implementation, not only, but it’s largely implementation. So there’s that bridge between intent and implementation that thankfully for that bridge, it keeps all of us employed.

Leave a Reply

(Note: This name will be displayed publicly)