Experts at the Table: Power and performance issues at the most advanced nodes.
Semiconductor Engineering sat down to discuss power optimization with Oliver King, CTO at Moortec; João Geada, chief technologist at Ansys; Dino Toffolon, senior vice president of engineering at Synopsys; Bryan Bowyer, director of engineering at Mentor, a Siemens Business; Kiran Burli, senior director of marketing for Arm‘s Physical Design Group; Kam Kittrell, senior product management group director for Cadence‘s Digital And Signoff Group; Saman Sadr, vice president of product marketing for IP cores at Rambus; and Amin Shokrollahi, CEO of Kandou. What follows are excerpts of that discussion. To view part one of this discussion, click here and part 3 here.
SE: What do engineering teams need to keep in mind as they start dealing with more disaggregation in custom and semi-custom designs?
Burli: From an architecture perspective, when you start designing your power grid for libraries, etc., how do you do that for mobile versus infrastructure, and how you do it for an NPU versus a CPU? That’s completely different. For starters, you need lots of flexibility around a power grid. People tell us they want to pack more into a design, and they want to get to about 80% 85% utilization. How do you bring in all of that power requirement when everything starts switching immediately? And then we have big data centers that tell us, ‘I have a PDP (power distribution panel) and a thermal design power budget, and you guys need to hit that budget. What are you guys going to do to help us get there? We might be running fast at certain workloads, and in certain cases we might be running slower.’ This is a massive challenge. If you look at sign-off, typically people say 10% sign-off IR drop is good enough. But that’s not what’s happening now. When people are running these designs to get well over 10%, they say, ‘Oh, maybe I need to put more margin somewhere else.’ So do you put that margin in and keep adding margin on top and not put in functionality? Or do you start looking for a way to stabilize your power? If you build new circuits on top, how do you stabilize power? Or how do you bring in power much better? Do you use something like adaptive clocking?
SE: We’re seeing much more focus on hardware-software co-design to improve performance and lower power, but much of this is very specialized to specific markets. How will this kind of approach affect designs?
Geada: One of the fundamental things it does is change the role of EDA. You can’t wait until you get silicon to find out if this thing is going to work under all scenarios the software can do and all usage conditions that you envision for this functional part. That means you have software that can deal with this problem as early as possible, and at scale. You can think about putting thousands of sensors on a chip, which is interesting, but by that time you already have a chip. . You need to have some envelope bounding the operation of this design to make sure that whatever the software is doing, you’ve covered all the cases and that that it’s going to mostly stay in the box. That requires, in our opinion, a different approach of building EDA software. It has to be something that works at scale, that can deal with software-scale vectors, and find interesting conditions in that space. Then it needs to tell you that if your software does a particular thing, then your thermal budget might run out and you may want to think of a different way of cycling through your cores. Or if your software does something at the same time that the SerDes, you’re going to have a voltage drop ‘over here’ that you’re going to have to mitigate somehow. You need to see the entire picture all at once and analyze it at scale. You need analytics that can actually tell you, ‘What can I do with this? How do I improve my design? How do I make it work under this bounding box of functionality and power and thermal constraints? How do I make it all work, how do I make it yield, and how do I make money?’
Toffolon: Complexity is going up everywhere. It’s going up in software, and it’s going up in interconnects. These interconnects are not like they looked like five years ago where they were part of a hardened design, ‘set-and-forget’ type of links. These links now include a huge amount of firmware and software, because you need to be able to sense your environment, adapt in real time, and really optimize the performance of the link. For lot of these serial links now, the bottleneck and the challenges are not primarily analog. They’re on the firmware and algorithms side, and you need to optimize those things under various operating conditions. That’s a paradigm shift.
Kittrell: We’ve seen that once customers have tested out a chip and it’s in production — especially a multi-core CPU system going into high-performance computing like in the cloud or on the edge — they discover power escapes. They’re burning a lot more power than they thought because they weren’t bringing down a core properly, or they were causing some sort of bus chatter in the midst of switching things around. Then they have to go back and identify that and change their firmware in order to to mitigate this problem. So having the ability to test the firmware, or at least anticipate that capability very early on in the architecture, is very important.
Bowyer: You have to put margin in your architecture. You have to leave a configuration in the hardware that don’t think you need today in order to be able to do this stuff. We’re seeing a lot of that where the hardware is over-controllable. There are things you never expect to change, but you put them in there just in case. That’s true for AR/VR, AI, 5G — they all start to look like mini-processors with all of this re-routing, reconfiguration, reorganization, just just because you’re worried something might happen. Everyone would be happier if there was some way to go and simulate all of this in a way that you could really trust, where you have all the power data and you really understand everything and nothing’s going to change. But the reality that we face today is the hardware has to be built to be reprogrammed by software to fix all kinds of problems, including power.
SE: Do you have that margin anymore? Once you start getting down into the advanced nodes, that margin starts becoming really sought after by a lot of different resources.
Bowyer: It’s a it’s a very tough problem design problem. But for sure, there’s a lot of margin built in at the architecture level to be able to handle this stuff that maybe is never going to be needed.
Shokrollahi: Often we are often asked by our customers to minimize firmware. Even with very advanced SerDes, they want some kind of an intelligent equalization that does not use firmware. Firmware also can be a security issue.
Sadr: There are interfaces where, for reliability reasons, maybe we opt not to implement a software-based solution. But at least in the broad interfaces that we see, software has crept in over the past 10 years to the point where you now have to put that into your architectural planning. If you’re using a microcontroller, software has to be a part of it. Maybe five years ago or so, we started seeing that the power consumption trend across PVT (process, voltage and temperature) variation, without making use of your hardware, was about 30% to 40% relative to nominal cases. It’s absolutely a must nowadays to deploy that software overlay to optimize your circuit in order to bring that to something in the order of 10% across PVT. We don’t have margins of 30% anymore, so it absolutely has to be a part of your architecture planning —software in conjunction with hardware — for the interfaces that we we plan and deal with. Those are becoming a part of the challenge themselves. Microcontrollers were supposed to be a simple thing you put into your infrastructure, but now they’re becoming a bottleneck because firmware is not running as fast as your 112 gigabit per second interface. It’s running at a much lower speed, but it has to manage something that is adapting very quickly. If you want to beef up that software processing component, now you are burning additional power. So there’s a lot of a smarts required in the planning of that. It’s a firmware element, and it’s showing up in every part of the vertical stack for thse interfaces. And so security is becoming a concern, as well — how to secure the whole datapath.
SE: And some of these chips are going to be used for longer periods of time, too, right? You have to build these devices securely, deal with aging circuits, and all of this is more complicated, which is boosting the price. And all of this needs to be monitored more than in the past. What’s the solution?
Bowyer: More custom hardware. You can’t generalize. You have to be very targeted and specific about what you want to do, and you have to build something that’s going to fit. If it wasn’t for power, why not just throw a whole grid of CPUs at the problem? That’s what’s really what is driving a lot of the custom hardware today — getting the energy down in an efficient way. But then you have to build custom hardware, and that’s a big task. It’s hard to predict if you’re going to succeed. It’s much harder now than 10 years ago to know, when you start a design, whether it is really going to work. Especially in AI, we see a lot of companies that are doing most of their hardware design process, and then they have to scrap it and try again because they realize they’ve blown out the budget, or it’s not fast enough, or something is broken in a way they can’t just tweak and get to work.
Burli: These devices are going to have a much longer lifecycle, and so you will need some flexibility. People are not going to spend $1,000 on a new smart phone unless they can keep the device on for four years. So you need some sort of standardization and an ecosystem that works together. And when you start thinking about 3nm or 5nm, you do need to start thinking about flexibility and how to minimize some of the data movement, so tomorrow if there is a new use case, that architecture can take care of those kinds of situations as well. You have to think about the architecture in terms of devices working in concert with each other. It can’t just be about an NPU by itself. How it works with the different elements out there? If two years from now a different use case does come up, it can be managed and you still can stay within budget.
King: When you think about software and aging, the software is not something you know enough about at the time you design the chip. In an ideal world, you design the chip to cope with whatever software is run on it, but that’s not always feasible. What we’re seeing now with a lot more customers is they are using monitoring to allow them to adapt to the software loads that have been put through a chip. We’re definitely seeing much more specific designs for given tasks. At Moortec we are working with a couple dozen AI companies, and you’d think that from one to the next you would see somet similarities. But every one of them is doing something different. Some of them might not work, but there are an awful lot of reasons why they’re doing different chips. A number of these are reticle size, but we also work with customers doing AI chips that are very small and very targeted for a specific type of inferencing. But something we are seeing far more of now is these chips are more application-specific. They’re being designed for a specific use case.
Geada: Aging is an interesting topic. We can simulate and predict a little bit of aging, but that really requires data. But at 3nm we have maybe a year’s worth of data, and it’s really difficult to predict how it’s going to behave 10 years from now when we haven’t accumulated 10 years worth of data. We can run the physical models, but it’s really difficult to predict how all of these very varied designs and architectures are going to behave. This is one of those areas where we’re really going to need to have a combination of in-chip sensors feeding data to a digital twin, which takes all of the data about the behavior of the chip with the current software or the current firmware, and extrapolates a little bit further out to say, ‘Is this likely to fail or not?’ If a phone fails, it’s one thing. But if your car’s pedestrian sensor fails unexpectedly in a self-driving car, that’s a really, really big problem. It can’t do that. Safety-critical systems need to have a fail-safe failure mode. That is something that a lot of these domains are going to need to start taking into account. The sensors need to tell the chip whether it is close to failure. The software needs to be aware that failure is a possibility. You can’t just assume that hardware is always going to work 100% under all conditions, and you need a plan to deal with it and make sure that you simulate those cases and deal with them.
Related Articles
Power And Performance Optimization At 7/5/3nm
Experts at the Table: What happens when AI chips max out at reticle size?
Leave a Reply