Experts At The Table: Hardware-Software Co-Design

Second of three parts: Disjointed design schedules; approximately timed modeling; the limits and benefits of RTL, emulation and FPGA prototypes; general-purpose vs. highly specific processors in SoCs.


By Ed Sperling
System-Level Design sat down to discuss hardware-software co-design with Frank Schirrmeister, group marketing director for Cadence’s System and Software Realization Group; Shabtay Matalon, ESL market development manager at Mentor Graphics; Kurt Shuler, vice president of marketing at Arteris; Narendra Konda, director of hardware engineering at Nvdia; and Jack Greenbaum, director of engineering for advanced products at Green Hills Software. What follows are excerpts of that conversation.

SLD: How often is co-design really warranted?
Greenbaum: The case for power and performance may be more common than you think. Looking at the products from Nvidia, is a Tegra processor a standard product or an SoC? You put a Tegra in an automobile running an IVI (in-vehicle infotainment system and maybe an instrument cluster. That instrument cluster gets prototyped on a desktop. Then someone gets the idea of taking that high-polygon-count design and putting it in a car. Now you have a cost you have to optimize. If I can’t do it with today’s Tegra, can I do it with the next one? You’re not going to know. That’s one advantage of co-design. You can know before the sand gets melted. Even if your application can be implemented in software, co-design to let you know what price point you can do it at is very valuable.
Konda: We are delivering a solution today that is not just silicon. It used to be the case 10 years ago. But today it’s a complex piece of software and highly complex software. These two things have to come together. First and foremost, we have to make sure the design is in a semi-working condition before bringing in the software team to develop their code. To do that, we cannot wait for emulation or an FPGA. These bits and pieces of a design where we have multiple cores, multiple processors in that SoC—20 to 25 processors and components and interfaces—creates a very complex device. RTL is not available all the time. Some parts of the design are in RTL, some parts are in C models. As soon as we have a little bit of confidence we want to encourage the software team to come in. That’s very early in the design cycle.
Shuler: Why do you wait so long? Why don’t you do it as soon as you get the initial requirements?
Konda: We have been doing that for a number of years, but it’s still not a full-fledged solution. In our case, we have 30% to 40% of the design modeled at the very beginning and software teams are already working on that model. But it is not the entire SoC. They are working on the GPU or CPU portion of the SoC. How do we model all of these peripherals? That’s not there yet.

SLD: There are two trends unfolding in IC design. One involves a general-purpose processor, where you may leverage one or more cores and only a specific amount of memory. The second is a very specialized processor where you may run a specific application. How does co-design deal with these different approaches?
Matalon: One issue is really validating the spec. The earlier you can capture your specification using an executable specification where you don’t just generate UML diagrams but simulate it dynamically to represent the conditions upon which the specification is working is an ideal solution. This is a level above implementation. That’s very important. If you go one level down and start doing partitioning in LT mode, you can’t really evaluate the tradeoffs. It’s good to refine the specification, but you can’t validate the performance, bandwidth and power are there. In my view, the key is to first focus on architectural exploration to make sure you have the right performance and power. For that you need an approximately timed model that allows you to do an evaluation of performance and power for your standard processor, for your specialized processor, for multiple cores, for all the combinations and topologies. It cannot be too low-level so it can be used in a way where you can do power/performance evaluations, do a power budget, and if you need to, you can shut it off and run other parts very fast. That’s where I see the ideal solution. Some customers are using it and many customers are not. You can’t wait for the RTL. It’s too late. You can do co-design, co-validation through all the stages of implementation, but from a design perspective for these types of designs you have to start above RTL.
Shuler: When we’re talking co-design, it’s a people problem, not a technology problem. You don’t see kick-offs where there are hardware and software people in the same room—or even where the architects and verification people are in the same room. The semiconductor vendor is responsible now. When you think about it, the real customer of the semiconductor companies are the software vendors.
Greenbaum: Absolutely. And very few semiconductor companies recognize this.
Matalon: It’s not as bad as that, but it’s not yet the prevalent methodology. Co-validation is already quite entrenched, because emulation, acceleration and virtual prototypes are really co-validation. The co-design—evaluating the performance and power—is still at the early stages.
Greenbaum: The big difference is between code software drivers written for verification or validation vs. those written for real applications. The semiconductor vendors that are doing the worst job of delivering a full platform of silicon and software don’t understand the difference. They’re delivering verification code, but when you try to use it in a software environment it rolls over and dies very quickly. There is a spectrum of companies that get it, and if you look at the acquisitions in the embedded software arena—Cavium acquiring Montavista, Intel acquiring Wind River, the in-house Linux teams that are pervasive and Mentor with the Nucleus product—we’re seeing the recognition there. But only the top vendors are there today.
Schirrmeister: There is always the Yin and Yang in here. We have polar opposite trends. There is the generalization of the processor, which is meant to not shoot yourself in the foot. In the embedded space you have Java-based applications. Those development environments are built in a way that is very abstract. On the other hand there are highly specialized processors enabling highly specialized applications—highly specialized hardware with an abstraction layer and then the application development environment built on top of it. Now, going back to the models, if you had the model generator where you just talk to it and it creates the AT model, that would be the perfect environment. As a practical matter, what chipmakers are looking for is the ability to mix and match. The AT model is great to represent some of those effects, such as area, power and performance. But in the next version, the question might be slightly different. That makes it very hard to build those AT models.
Shuler: We have to do all three in addition to RTL. We have cycle-accurate, loosely timed and approximately timed. You never know what people will need.
Schirrmeister: As a practical matter, there are a lot of people using AT models. But in parallel people are taking the appropriate model for the system and hooking them together. So you have emulation or rapid prototyping for the pieces that are already stable, which is where IP re-use comes in. You may not have to rebuild them as an AT model. Having a processor model of the next big.LITTLE chip and execute the software, and for the subsystem that does more computing to be able to analyze the performance, allows you to create the right mix. Would the perfect environment be to have AT models for everything? Absolutely. Can you practically build AT models for everything? It may not be possible all the time.
Matalon: AT models are ideal because the other models are too slow or don’t have sufficient information. I disagree it is difficult to build them. We automate that from simple definitions. The challenge we see sometimes is that the functional model doesn’t exist. You have a design that is very complex and now you want to build a functional description that is equivalent to RTL. To wrap it with timing and power can be fully automated. Even the entire platform can be fully automated. But when you have a complex design and you want a functional abstract model, you have to write it yourself. If you put the RTL on an emulator and connect it to the rest of the models you are missing some of the capabilities of how you evaluate power or performance and you’re using RTL again, so what’s the point?
Konda: On the models front, it’s true you will not be in a position to provide models for the entire SoC. And to expect a functional model from an EDA vendor is not realistic. In an SoC environment, we have a number of interfaces and devices that get attached to the SoCs that are standard specs. At Nvidia we design our own CPUs. We also use ARM CPUs. We have a GPU. If you look at where these teams are, the CPU guys are in Santa Clara, the video guys are in Shanghai, and the simple interfaces like an SD card are in India. To realize the SoC model there is no common platform that pulls all these things together. The CPU and GPU guys are forced to develop a high-level C model. They start doing their work earlier in the cycle because they are forced to. Each team is doing whatever they have to do. But bringing all of these things together to create an SoC is the biggest missing piece. If you run a Facebook or a Twitter application, power and performance are key. So how do we estimate the power consumption? We cannot do this in one platform. With emulation it is too slow. With an FPGA, by the time the FPGA starts working the chip has already come back from the fab. It is a mix of parts of the design on an emulator or an FPGA, which is a real model, and then you hook up the rest of the design that gives a good approximation of the real system. It may not be highly accurate, but if you can estimate power and performance plus or minus 10% that’s still great.