Experts at the Table: What happens when AI chips max out at reticle size?
Semiconductor Engineering sat down to discuss power optimization with Oliver King, CTO at Moortec; João Geada, chief technologist at Ansys; Dino Toffolon, senior vice president of engineering at Synopsys; Bryan Bowyer, director of engineering at Mentor, a Siemens Business; Kiran Burli, senior director of marketing for Arm‘s Physical Design Group; Kam Kittrell, senior product management group director for Cadence‘s Digital And Signoff Group; Saman Sadr, vice president of product marketing for IP cores at Rambus; and Amin Shokrollahi, CEO of Kandou. What follows are excerpts of that discussion (part 1).
SE: What are the big challenges and tradeoffs in power and performance at advanced nodes, and with AI at the edge?
King: When we moved to 16/14nm, there were a lot of speed gains, and leakage dropped so far that everybody benefited massively from being able to burn more power doing stuff. As we’ve gone from 7nm down to 5nm, leakage has started to creep back up again — almost to where we were at 28nm. People are now having to balance these things. That said, sizes of die are massive in comparison to what I’ve seen before. AI is requiring really big die. They definitely have different balances to make. It’s not necessarily about whether they want to burn the power. It’s now about how much power can they physically get into a die that big. How do you deliver the power in, and then how do you cool the whole thing?
Toffolon: With the disaggregation of some of these larger chips, and especially in the AI space in terms of being able to scale out the power, the challenges are really on the interfaces and trying to optimize latency, bandwidth and reach. And there are different packaging technologies that enable different types of solutions in that space. That’s where we see a lot of our power optimization and power exploration activities going on. It’s trying to optimize those die-to-die or chip-to-chip links, and trading off packaging costs and bandwidth latency around those interconnects.
Kittrell: The big concerns are total power consumption and the heat generated on some of these chips, especially large networking chips. The surprise is that many people didn’t have a handle on what was burning so much power. They needed to get workload information up front in order to do dynamic power optimization. We’ve been focused on leakage power for so long that once we switched over to finFET nodes, the dynamic power took over and that became a big issue. Another concern is bring-up for a multi-core chip. There’s always been mitigation for this in DFT so you don’t kill the chip while you’re testing it with IR drop. That’s becoming a problem with turning on multiple cores at any one time. You can overload the system. You can’t put a capacitor that’s the size of an air conditioner out there to mitigate for di/dt. So there’s got to be smarter solutions for that. On top of all of that, this is an interesting time because of machine learning and AI, which are causing a renaissance in computer architectures. People are coming up with novel functions for domain-specific architectures, but they want to quickly be able to investigate these architectures up front, see what it’s going to look like in silicon, and make rapid tradeoffs at the beginning. But power is the center of focus for most of the customers we’re talking to at 7nm and 5nm.
Bowyer: At the edge, where people are building these new AI processors, it’s going through the same thing that you see with CPUs where people need hardware accelerators. They’re having to build custom hardware to save energy, save power, as you do with any CPU or GPU or whatever processor they’re using. But the real question is, ‘How does the data move?’ You’ve got these huge chips. You need to move all of this data around the chip in an efficient way — in a way that doesn’t burn all your energy or power. There are hundreds of architectures to choose from. With AI, there’s so much research going on that it’s hard to keep up. You couldn’t even read all of the research papers today to know what’s the best architecture. And so most teams are starting out not knowing if they’re going to be able to finish. With high-level synthesis, which is where I’m getting brought in, the teams have realized they’re going to have to build and test something, and then build it again to get it right.
Burli: If you start looking at it from an architecture perspective, what is it that fundamentally you can do different on an NPU? It is different. It’s not the same thing. And thinking about the data flow movement and what you do with that data is extremely important. How can you optimize for that? You need to make sure that you’re not replicating, because it’s going to be a lot of design, and people are trying to cram in a lot more design into smaller area. And when all of these circuits come on, you need to know what happens to the power and how you get all of the heat that is generated out of there. That is the massive challenge. So for us, it really boils down to what do you do with the architecture? How do you then go back to the foundry and say, ‘Alright, we are building these logic libraries and custom memories. How do we make sure that co-optimizes well, not just with the architecture, but even with the process that the foundry has?’ That may be 7nm or 5nm. Eventually it all boils down to how can you implement this all really, really well so it all fits together and gets you to the right number.
Sadr: We definitely have a massive amount of data that we have to worry about transporting. If you look at kind of the power budget that is being consumed at the moment, on average on the compute side, versus transporting that data, you’ll see that we’re spending about 70% to 90% of power in a system to transport that information. So that practically is the umbrella challenge that everybody’s dealing with, and it translates to all the latency and bandwidth and reach tradeoffs. What that means for architectural solutions is deciding whether you should transport the data in parallel format or serial format. We have to decide a whether a hybrid solution is better, and whether we want to transport more data electrically versus optically. And finally, these are heavily mixed-mode circuits. And when you go from 7nm to 5nm, then you’re challenged with the fact that mixed signals and mixed-mode power doesn’t scale as well as we would like it to.
Geada: We’re seeing a lot more reticle-limited chips. We’ve hit a fundamental limit in manufacturing. You can’t make a chip bigger than reticle limit. This is why, all of a sudden, there’s a whole bunch of interest on wafer-scale integration, 3D-ICs, and things like that. We’ve run out of space. We have a lot of functionality, but we have no more room to put stuff into. And so once you reach the reticle limit, all of a sudden you have to start doing a whole bunch of additional techniques to try to get functionality out. This is why we’re starting to see a lot more application-specific designs. You always pay a cost for a general-purpose architecture. And so now we’re starting to see a real big resurgence in application- or domain-specific designs. There’s a whole bunch of families of AI chips, because they’re all focusing on a slightly different version of the problem. Whether you’re doing inferencing at the edge or you’re doing cloud-side, big-data, high-performance stuff, they’re all looking at the problem in a slightly different way, which ends up in a different architecture. That puts a whole bunch of pressure on design. When everybody was designing a generic CPU, there was a lot of information sharing. That was easy across the industry. With everybody doing their own specific version of the architecture, they have to figure out where the challenges are and where is silicon going to bite you. A lot of people focus on the design side. We tend to focus a lot more on whether the design is going to work for you in the end with the right power budget at the performance you expect. And because of the reticle limits, people are starting to explore additional dimensions. Does the SerDes have to be on-chip, or can it be above on a chiplet? How do we get the best price/performance behavior predictably with reasonable yields. It’s difficult to make good analog design at 5nm, but we know how to do that at different nodes. Maybe move some of the components into different nodes and stack them on top. But that just opens up a whole bunch of different problems on how you validate a complex heterogeneous system under all the operating conditions and performance constraints.
Shokrollahi: For us, going from 7nm to 5nm is a nuisance, but we have to do it because our customers expect us to do it. We don’t see a whole lot of advantage. An circuit is not going to scale as well, and it costs a lot of money. That’s why a lot of customers are coming to us, trying to minimize the amount of silicon they have to do in an advanced node. They want to keep things in the older process nodes and have I/O between the things inside the MCM. Most customers that I talked to said if they did not have to go to a lower process node, they wouldn’t do it. They would stick to what they already have. Moore’s Law said that it may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected. The availability of large functions combined with functional design and construction should allow the manufacturer of large systems designs to construct a considerable variety of equipment both rapidly and economically. In that paper, he didn’t just forecast Moore’s Law. He forecast chiplets using separate die and functions.
SE: So are we going to see 3nm chips as a full chip, or will we see 3nm chiplets connected to other chiplets? And how does that bode for power and performance?
Shokrollahi: The 3nm timeframe may still be too early to see the full emergence of chiplets across the industry. Maybe the next one, whatever the next one is. We do see it, but I don’t think that’s the same timeframe for a rollout of chiplets.
Sadr: For pre-qualified chiplets with massive rollout, that seems correct. But in terms of rolling out chiplets, disaggregation already is starting at both 7nm and 5nm. Chiplets and interface products allow an SoC to do what is does best, handing over the transport of data to a smarter chiplet, like what you see with the 5G kind of infrastructure, integrated in a way that’s like adding an ADC or DAC. Those technologies don’t carry over to the next node quickly, and there are chiplets that already are coming up for that. Maybe they’re not in the volume you will see down the road, but it’s already starting. We’re definitely seeing that trend.
Toffolon: I agree. It’s already started. We’re clearly seeing active designs where the dies are split, and they’re using die-to-die links for interconnectivity. Some commercial devices already are doing that, as well. We’re also seeing there is no one-size-fits-all solution. There are chips that are being disaggregated. There’s actually dies that are being aggregated. And there are cases where designs are scaling, again replicating multiple dies to scale out. Each of those typically goes into a different packaging technology with different reaches and different losses. And they typically require different solutions with very, very different power profiles. It’s really dependent on the end application and the packaging technologies, and ultimately the die-to-die interconnect solution that you end up going with.
Kittrell: At Cadence we’ve had a 2.5D and 3D-IC solution for a long time. We had a little bit of activity on this. But with the advent of 7nm and 5nm, we’re seeing a lot more demand for it. If you’re going to 7nm or 5nm, a lot of these are going to be for machine learning or networking. The architecture is going to be highly parallelized, and there will be repeated instances all over the design. So you as big as the reticle is, they’ll fill up that reticle to the capacity with their main function to carry the workload. For things like I/O, they would like to have a chiplet with a fast interface to it in the same package, so you don’t have to go through the board parasitics. That’s a place where we’re going to see a lot of innovation in the next couple of years.
Geada: We’re dealing with physics. One picosecond is a third of a millimeter. Anything you push further away is expensive. That’s why we’re seeing all of this immense pressure to put more and more stuff closer and closer to the chip. If that’s a series of chiplets off on the vertical axis, they’re not nearly as useful because that’s just bringing it into the package, but not necessarily close enough. One of the great things about high-bandwidth memory is that it has physically limited the distance. However, one of the things about the progression from 7nm to 5nm to 3nm is cost. Wide adoption is unlikely in the near term just because of cost. It’s getting really expensive to do physical device design at these advanced nodes. There’s a lot of things that you have to be careful with. There are a lot of simulations that need to be done to get working silicon. But the reticle limit is is no different at 3nm or 5nm or 7nm. The reticle is about an inch. If you want to put more transistors on your design, you have to move down to the next node, just because you’re fundamentally limited on how many transistors you can put in the reticle. And the reticles are not changing. We’ve explored the limits, and right now an inch is about as big a chip as you can make. So the only way to get more functionality is to go down a node, and I don’t see that pressure going away. Even with the AMD and Intel, they have a whole bunch of 3D-IC stuff. But the chip is already reticle-limited. They just needed to put stuff outside so they could pack in more processing power within the power budget that they have. I don’t see those pressures going away anytime soon. And it’s the same thing that’s leading people to do create application-specific designs. It leads to a more challenging back-end design environments. Yeah, there’s the optimization problem up front and figuring out whether you have the right architecture, but once you get down into the physical environment, now you have challenges that the EDA industry largely on the design side has been ignoring for a while. People do timing sign off assuming that the entire chip is at a single temperature or has a single voltage, or that they have a single process node. That hasn’t been true for a while. You have complex stack-ups that have 28nm stacked on top of 5nm with a memory at a different geometry and a different voltage. And that’s the reality that we need to deal with these days.
Bowyers: Imagine you’re an architect and you’re trying to figure out how to distribute your hardware on all of these chiplets. What if you get it wrong? Imagine you get to the end and you realize you’ve run out of space on one of your chiplets. How do you push this thing out somewhere else? It’s really about power and area and how much can you put onto these for the application. Today we see conservative decisions on how you distribute this stuff. You know everything is going to fit there and it’s going to be fine. But once you get into to a new chip that nobody’s ever built, and you’re not exactly sure how big it’s going to be, that becomes a much, much harder problem.
Toffolon: A lot of the power is coming from the interconnect. Understanding the power profile of all those links is really important, because typically those types of links are the design for the worst case. Most vendors would quote worst-case power, but the reality is that depending on the process or voltage or temperature, or even the channel itself, there are a lot of hooks in the advanced IP links or serial links to really optimize per link. Being able to understand and model that power at a macro level – not just taking a single number and rolling it up, but really understanding what the nominal type of power profile is of the overall solution — is super critical to model up front.
King: We’re seeing some customers building chiplet systems where you might have a 5nm die with lots of compute on it, and then sensors might be done on 28nm. But at the same time, the reason customers are going to 5nm and 3nm is because they’re already reticle limited. Otherwise, they wouldn’t do a 3nm die. They’d do a 5nm die. Chiplets are not widespread yet, although there are cases where it makes sense. The 3nm market is being driven by high-performance compute and AI, and that means huge dies, and lots of them, in racks in data centers, burning lots of power. One of the challenges we have with that involves the requirement for large amounts of sensing. With signoff, long gone are the days where your chips at one at one PVT point, especially if you’ve got a reticle-sized die. We’ve been seeing the growth in the number of sensors that are put down on these chips over the last two or three generations, from 1 or 2 to hundreds, and we’re in conversation now about what can we do nearer to 1,000. That’s a lot of data being sent across a lot of die. And they’re starting to move from sending data across 1 die to sending it across lots of die, and then making power tradeoffs across a system, which may be a data center or racks of servers within a data center. You’re not necessarily running a particular chip at a particular point. You’re basically balancing your electricity bill. That’s a different angle, perhaps, than what chip designers typically take. But being able to turn things on and off, and basically turn the wick up and down to cope with dynamic shifts in electricity use, is an interesting space.
Related
Performance And Power Tradeoffs At 7/5nm (Part 3 of this roundtable)
Experts at the Table: Security, reliability, and margin are all in play at leading-edge nodes and in advanced packages.
Custom Designs, Custom Problems (Part 2 of this roundtable)
Experts at the Table: Power and performance issues at the most advanced nodes.
Leave a Reply