Second of three parts: Specific processors, better utilization of chip real estate, more accurate measurements, and better management of data.
Low-Power Engineering sat down with Barry Pangrle, solutions architect for low-power design and verification at Mentor Graphics; Cary Chin, director of technical marketing for low-power solutions at Synopsys; Vic Kulkarni, general manager of the RTL business unit at Apache Design Solutions; Matt Klein, principal engineer for power and broadcast applications at Xilinx; and Paul van Besouw, president and CEO of Oasys Design Systems. What follows are excerpts of that conversation.
LPE: What’s the best practice for dealing with power in complex chips?
Van Besouw: In today’s chips, with 50 million or 100 million gates, it’s inconceivable that the whole chip is functioning at the same time. You have to make decisions at the architectural level as to what’s on and off and which voltage islands to use. It’s another level of complexity. We’re just scratching the surface of trying to manage that. It’s one thing if you’re doing it on a small design. It’s another if you’re doing it on a complex SoC. There’s still a lot of headroom.
Chin: There are two separate things we’re dealing with. One is high-level complexity. Everyone agrees you have more leverage at the architectural level. But if you look at dynamic power optimization, a lot of what we can do at the implementation level we can do better if we understand what modes of operation we’re trying to optimize for. If you’re trying to optimize your silicon—even your physical implementation for a specific case—then you can optimize for that case very well. If we can take our optimization and weight it in the direction we want, whether it’s 75% of the time or the maximum operating condition or anything else, understanding those things has become very important. We have ways of optimizing our tools for activity constraints. We can put in specific vectors. All of these things can lead to different implementations, both logically and physically, of the actual circuit. Circuits can be optimized for certain conditions. But the problem I’m seeing today is even in the few cases where we have information about switching activity, there’s no context to tell us, ‘This is switching activity for this particular mode with these pieces powered down and these pieces running.’ We need to add that kind of information. That will make all of today’s tools work much better than they do today.
LPE: Doesn’t that make it harder to devise a derivative chip because you’re optimizing for each market?
Chin: Not really. Imagine if you had a device that ran twice as long just for gaming. Or you have a phone that gives you three weeks of talk time if you don’t do any gaming. This is an opportunity from the design side and the manufacturing side. In an era of dark silicon, where we have more gates than we can power at any given time, we can customize that hardware. Let’s push in the direction of specific implementations that allow you to optimize for all these applications. It’s also a way for people to differentiate their products.
Kulkarni: We almost need to reverse the paradigm of design. Why should we have a processor architecture for an embedded processor and use it for various applications such as YouTube or e-mail or music? We should do it in reverse. What kind of architecture does Facebook need? There could be a ‘processor-Facebook’ and a ‘processor-YouTube.’ The stimulus is different, the power consumption is different, and stimulus is becoming more critical for power consumption. You essentially are taking your Vcd power and analyzing your function based on your Vcd set. The real problem is power, but we’re searching for solutions because some Vcd set is available to you. So now you can start talking about power budgeting. What will happen to a power grid design? Will you create EMI problems or EM problems? Will it blow up while you’re using a Facebook application?
Chin: Today you can carry around a navigation system in your car that will communicate back into the cloud in real time so other people will know to avoid traffic congestion. Wouldn’t it be great if we could use that same technology to understand what modes in the chip are being used? That’s exactly what we want to know. Here is the switching activity that millions of people are using. We could write that today with EDA tools that can run on your phone. We don’t have that information. We need to customize more on usage all the way up to the application so we really can have a Facebook processor.
Pangrle: A lot of the architects have been focused on performance. It varies from application area to application area, but the guys doing processors with standard instruction sets know what they need to cover and how to get more performance. They need to extend what they’re doing to include both performance and power.
Chin: And they need the analysis for power. We do really good analysis for performance these days, but for power it’s a lot more nebulous. And to a certain extent we do performance analysis statically. But in the power realm, it depends on what you’re executing. At some point, when you’re averaging across too many cycles, you’re losing the information you need. Today it’s still important to look at dynamic capabilities, especially with power budgeting. What people want to know is how much resolution do you need. You need to know where all the peaks are. Today, power reminds me of dynamic simulation 20 years ago when people looked at timing of specific long paths.
Pangrle: Being able to look at different modes is important. But if everything comes in at the same activity level you can’t distinguish or optimize for that. Being able to capture at different modes what that activity is that’s related to that mode is a big help. The things that can be done at an architectural level will have an impact downstream, even in regard to the tools needed to get the job done. This is like multicorner-multimode at logic synthesis or place-and-route. You need to make sure you’re not just meeting performance, timing and power at one operating point. These things are operating across multiple voltages. You need to make sure you’re safe across all the process corners as well as all the operating modes. That’s having an impact on the downstream tools. I like the idea of having a different processor for a different application.
LPE: It seems that another large EDA company has pitched a similar idea.
Klein: That’s where the FPGA is unique. It is, by definition, meant for an arbitrary number of modes of operation. We can dynamically reprogram sections of the FPGA while other parts are running. You can change functionality based upon what you see coming in. Additionally, because we have a programmable device, we need to put levels of hierarchical things that people doing synthesis can take advantage of—or which smart designers who understand all the modes of operation can utilize. We have hierarchical levels of clock gating. You can globally gate off the clocks with a one or zero. At a regional level you can have multiple clocks to gate off tens of thousands of flip-flops or block RAMs or DSPs at one time. Then we have finer gating at the individual block level. Each of them has various benefits and deficits. We also look at whether the contents of this flip-flop will be consumed on the next clock cycle. If not, I can gate it off on that clock cycle only. Knowing the functionality would be helpful for more global analysis, but if we don’t put the hardware capability in there in the first place to gate off locally, regionally and globally, it won’t matter what the software does because we won’t have the hardware features to take advantage of it.
Van Besouw: You have to make assumptions. The same functionality may be used on many different modes. That’s interesting because for one mode you may write completely different RTL. What you generate as an end product may be completely different. It may have different timing and physical constraints.
Chin: And these days that’s not beyond the realm of possibility. You can implement multiple modes because you have much more silicon than you can use. So why not have the Facebook processor as well as the gaming processor all on the same chip? You can power up the different sections based upon what you’re doing. In total you can save a lot more power. The tradeoff has always been timing and area. Now it’s timing, power and area, and area is probably third on the list these days. There are a lot of transistors on that chip. Figuring out what to do with them is something we’re having problems with. And the best way to control leakage is to shut things down. We’re starting to approach the more optimal implementations. It’s the reverse of resource sharing. There’s more and more hardware with specific functions.
Kulkarni: Besides power analysis, how do you refine the band of power? The assumptions you make at RTL almost always get thrown off the moment you go to clock-tree synthesis and place-and-route.
LPE: Meaning that when you take real measurements they’re not accurate?
Kulkarni: Yes, they may be off. So that means it’s not just the tools. Power budgeting is a set of tools and a methodology for the whole refinement from ESL to RTL to CTS to P&R and power-grid design. The capacitance can go haywire. Between the clock tree, what are the so-called source tree and leaves? What happens to the mesh clock structure if P&R tools play with it to do timing optimization? You can throw off all the assumptions you make at the RT level for power consumption unless there is a methodology to define power accuracy or inaccuracy. The plus or minus 30% should go down to 3% to 5% when you are doing final dynamic voltage-level signoff. The power intent will tell the tools what to do, but CPF and UPF do not tell you how to implement the low-power design.
Leave a Reply