Experts At The Table: Making Software More Energy-Efficient

Second of three parts: Providing enough information to software teams; modeling realistic scenarios; optimizing subsystems; changing attitudes among software developers; real optimization from further down in the stack.

popularity

By Ed Sperling
Low-Power Engineering sat down to discuss software and power with Adam Kaiser, Nucleus RTOS architect at Mentor Graphics; Pete Hardee, marketing director at Cadence; Chris Rowen, CTO of Tensilica; Vic Kulkarni, senior vice president and general manager of Apache Design, and Bill Neifert, CTO of Carbon Design Systems. What follows are excerpts of that conversation.

LPE: How do you get around the fact that there isn’t enough information available to the software team?
Kaiser: You could profile.
Neifert: You can profile to a point. A high-level virtual platform is too abstracted. It doesn’t have the concept of cycles in there. You need a level of accuracy with sufficient instrumentation. Often it’s not just doing any one single task. It’s what happens when these tasks intersect. What happens when you’re watching a video on your phone, talking to another person and someone calls on another line? It’s power drain happening at the same time, and you don’t necessarily test that from a system perspective.

LPE: Where are we starting to see the most energy consumed by software? Is it at the embedded level or further up the stack?
Kaiser: It can be anywhere.
Rowen: I don’t know how you can really separate power dissipation of software from hardware. You have to look at different subsystems. What’s the baseband subsystem doing versus the imaging subsystem versus the audio subsystem versus the graphics subsystem? Each one of those is going to be some compound of hardware and software issues. You can look at the independent worst-case scenario for each of them. Your first job is to make sure you have good characterization of what’s going on in each subsystem. Then you get to these interesting interactions. If you’re playing back a video you know you’re not doing maximum download on your wireless connection. Or when you’re recording a video, you know that something else is not happening. People have been forced to move from simple subsystem-by-subsystem worst-case analysis to looking at the whole interaction. It’s largely because they can get to a smaller worst-case number than if they didn’t consider scenarios.
Kulkarni: We found that with worst-case scenarios it’s easier to manage the power and to do hardware-software co-simulation, but within the subsystem itself there are so many different modes of operation that co-simulation gets even more interesting. It’s one level of power or energy reduction if you shut off the subsystem, but below that level how do you optimize that subsystem? You’re running software applications, which have to be co-simulated with the hardware. Relative accuracy becomes critical, although not necessarily the absolute accuracy. So how do you generate the testbench? How do you create power patterns? Selecting the critical energy-consuming patterns becomes a challenge. It’s one thing to model or create instrumentation for your software application, but you need a meaningful set of vectors for power consumption. Most of the functional testbenches are useless from a power consumption point of view. Looking at finite-state machines gets to be more and more critical. From the software application, how do you translate that into finite state machines that are control registers, which will then be translated into RTL? And then software is managing all of that. With one mobile phone application we worked with three different vendors for IP models, SystemC, an OSCI simulator and then a five-minute talk time. It would have taken about three months if the customer had not created higher-level models and energy-consuming signals out of that whole environment running together.

LPE: So the hardware guys are worried about power, but the software guys aren’t even thinking about it. How do we change that? Is putting up an ammeter enough?
Neifert: The ammeter is certainly a start. It’s a lot better than what they have today on the software side. We always have talked about concurrent engineering, and more and more processes get applied to that. This is just the next application. The first key is to provide a mechanism, then leverage it across everything and make sure that mechanism is as accurate as possible. But even a relative number is essential. Does this setting take 20% more? Give engineers good tools and they’ll figure out how to apply them.
Rowen: It’s the same as with a video game. If you give someone real-time feedback on the effect of what they’re doing, and they know what the green zone is doing versus the red zone, they’re amazingly effective at getting the needle down into the green zone and keeping it there.
Hardee: It’s not just optimization, especially at the higher levels of the stack. It’s more a case of, ‘The power consumption in this model is worse than the previous model. Something is wrong with the software. Fix it.’ And then the software guy goes off and finds the routine that is polling the modem way more often than is necessary and preventing it from going into sleep mode. It’s those gross errors that are being found when something goes wrong. Are we using the right capability, the right parallelism, the right pipelining or the right architectural facet of the platform to run the right piece of software? Those decisions are usually way down the stack. That’s something the operating system and the drivers have to understand for the various system calls that are going on. When you get into those lower levels of software, you need an accurate model of the platform that can start to tell you the energy usage you’ll get with those various selections to optimize further down the stack. With the application, it’s as simple as looking for what’s keeping something on when it should be off. For true optimization, you’re looking lower down the stack. You could hit the same problem with FPGA prototypes. You can run a decent portion of real-time and you can run some vectors, but what’s your characterization? You’re mapped to a prototype that doesn’t bear any relationship to real silicon. You need activity plus the characterization with enough compute power to run deep, real system modes.
Kulkarni: You need to turn this whole problem on its head. Why do you have to run Facebook versus YouTube versus GPS software on the same processor design? Why not create a Facebook processor rather than running it on a general-purpose processor? People are writing software applications ranging from medical imaging to health care to whatever else you need, and then tuning the hardware to that. And there will be multicore hardware implementations where it makes sense.
Rowen: That’s absolutely the case. One of the fundamental dynamics to emerge is that as power has become so much more important, people have begun to look at power as the ultimate goal and figuring out how everything else serves that goal. If that means you’re going to build a processor around an application, rather than the other way around, you’ll do it if it saves meaningful amounts of power. There are two key elements. We’ve talked about, if you can measure power that can help you make decisions about one processor versus another. The other angle changes the nature of the processor itself. You want processors where what software you run matters to power. That isn’t an obvious characteristic. A lot of people say that as long as every instruction dissipates power then that’s all you have to worry about. All you have to do is go find the one that consumes the most power and you beat down that one as much as possible. But you’re going to spend very little of your time running the worst-case instruction. You’re going to be running a mix of things. And even within your worst-case task, you’re not going to run your worst-case instruction all the time. You need internal mechanisms for clock gating, power gating logic reduction, so the difference between the lowest-power instruction compared with the highest-power instruction is no more than a factor of 10. If you’re running a lightweight mix, that will use an order of magnitude less power than something that does 128 multiplies in a single cycle. By having this big dynamic range you reduce average power and you make software matter. The programmer has implicit or explicit control over what instructions to use, so they can determine how much power to dissipate. You really need to provide people with energy feedback.
Hardee: Having those processor architectures that match the task is highly critical. But you only get a handful of programmers able to use that unless you have the compiler technology to match. You have to be able to automate and not leave it to the individual programmer to choose which instructions to use. The compiler has to be able to compile for performance versus power, just as you are with synthesis constraints in hardware, and it’s going to need to help me through automation to do the right thing.
Kaiser: Yes, we do need feedback. That needs to be there real time, if possible, and it should be better than an ammeter. You need to be able to graph it and correlate it to what’s running in the system, so when software engineers see a spike they need to know. But there’s another issue. Hardware provides a lot of knobs. The guy writing the algorithm is going to use them as little as possible. He will use those settings unless you tell him what those knobs do and why he needs to move them. Software engineers have no reason to change them. If the 128-matrix multiplication works, then they’re done. It’s functional. Power has been an afterthought for years and years.