PPA constraints need to be paired with real workloads, but they also need to be flexible to account for future changes.
Experts At The Table: AI/ML are driving a steep ramp in neural processing unit (NPU) design activity for everything from data centers to edge devices such as PCs and smartphones. Semiconductor Engineering sat down with Jason Lawley, director of product marketing, AI IP at Cadence; Sharad Chole, chief scientist and co-founder at Expedera; Steve Roddy, chief marketing officer at Quadric; Steven Woo, fellow and distinguished inventor at Rambus; Russell Klein, program director for the High-Level Synthesis Division at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion. [Here’s part 2 and part 3.]

L-R: Cadence’s Lawley, Expedera’s Chole, Quadric’s Roddy, Rambus’ Woo, Siemens EDA’s Klein, and Synopsys’ Cooper.
SE: Why is it necessary to right-size NPUs for increasingly complex AI networks on devices with tight PPA constraints?
Chole: Compared to SoCs, NPUs are doing orders of magnitude more pure operations. That’s the most important thing. The number of total theoretical operations being performed is orders of magnitude more. As we move toward more modeled architectures, which may be diffusion-based or token-based, the amount of memory the NPU needs to access also increases. It might be external memory, or it might be on-chip memory — that itself is at least 10X more compared to any other applications that can be there. Given these system-level constraints, getting NPUs to enable the applications to the accuracy that they need is an important task.
Woo: What’s interesting about an NPU is that it’s highly parallel. But how much parallelism do you really want? That’s going to have a bearing on things like the kind of model you’re trying to run and how fast you’re trying to get an answer. And these things go across all kinds of markets. At the highest end will be the data center. But for AI to become pervasive, you’ve got to be able to do a lot of that compute down at the edge and endpoints, as well. And those all have different kinds of environmental constraints, so moving the data that you need is important. There’s a range of different kinds of memory solutions that you’re going to see simply because the constraints of each environment are so different. Another thing that is interesting about all of this is that you’d like to have as much SRAM as you possibly can. But it’s just not scaling like you’d like it to, and that’s putting more pressure on the next level of the memory hierarchy, which is the DRAM. And the performance is so dependent on the memory that right-sizing is going to be a lot about not only the power and the performance of the processing, but what you’re going to have to do to supply the data that needs to get in and out of that processor, as well.
Klein: In the high-level synthesis universe, what we’re doing is taking algorithms and creating bespoke implementations of that algorithm using abstract synthesis technology. In terms of why this is all important, it’s really about the energy. How much energy are we going to use up to be able to perform a particular computation, or how fast do we need to do it? Usually, it’s a tradeoff between those two, and right-sizing it is enormously important. If you’re creating a generalized NPU, you’re going to need to be able to support 32-bit floating-point numbers, as well as probably some type of quantized multipliers. If you’re using high-level synthesis to create a bespoke implementation, you can do some interesting things, such as reduce the size of the multipliers to exactly what you need. Rather than having them larger or smaller to put in margins there, you can size them exactly as you need. And because the area and the power consumed by them are the square of the size of the operand, you can get a quadratic effect as you shrink those operators. By creating bespoke implementations, we can go much faster, as well as be much more efficient than a traditional off-the-shelf NPU. It doesn’t mean you can’t use high-level synthesis for creating a traditional TPU. It’s just that most of our customers are using it to create bespoke implementations where that acceleration is used not for generalized neural network execution, but for accelerating the specific one that they’re using.
Cooper: When we look at rightsizing an NPU, we have to look at what else can process an AI workload, and a GPU could do that. So why not just let a GPU do it? If we’re designing an NPU, we need to be an order of magnitude better in power efficiency, better in area efficiency. And then, are you doing an NPU that can do any AI workload, or are you doing the bespoke one? Obviously, there’s some rightsizing involved with that. And as Steve said, if memory is huge, and you know whether you’re going to do large language models or very small vision applications, there are some tradeoffs there. But the big one, for an NPU to exist, you’re probably going to do something that’s power and area efficient compared to a GPU, and you’ve got to start there. It’s still a processor. It still has to be programmable.
Roddy: As the others have noted, the AI acceleration engine in an SoC often is the most compute-intensive, most memory bandwidth-intensive subsystem in the chip. Therefore, it is critical to have the ability to exercise that NPU in a complete system, simulating real code running real workloads to measure expected switching activity and memory traffic. But chip architects are also aware that the benchmark networks at their disposal today are unlikely to be what their users need to run 24 to 36 months later when the chip comes back from the fab. Hence, programmability of a tensor-computation optimized architecture not only helps right-sizing today using current workloads, but it also helps future-proof the chip for when AI workloads change suddenly.
Lawley: Right‑sizing an NPU matters because every product space — whether it’s low‑power audio, sensor fusion, vision, or high‑end generative AI — will have very different compute, memory, power, and area requirements. Not being able to build a right-sized accelerator hurts both efficiency and user experience. In practice, you need the right MAC throughput and the right memory footprint to meet latency and accuracy targets without blowing out PPA. As an IP provider, being able to provide the right dials in the generation phase of the IP is critical.
SE: Let’s talk about the tradeoffs. How does the architect break these down and go from there?
Cooper: Again, it is an NPU, so it’s a processor. That means it has to be programmable. There’s a danger of saying, ‘Back in the CNN days, we just threw the right number of MACs down, and we were good.’ But [for NPUs], it’s not about the multiply accumulates. It’s getting the data in, and then the challenge is, can you design for programmability? Because that’s the tradeoff with the GPU, which you can program a little more easily. With an NPU, since the pace of change for these models is significant, you have to have some level of flexibility in your design. Therefore, software becomes key.
Klein: The hardware design can go back up into the design of the neural network itself. This means you can make tradeoffs at the level of whether to increase the number of channels in this layer and use fewer layers. Or do you use more layers? Are you going to increase the number of bits in your multiplier and use fewer channels? So, you can trade off between the amount of data, the size of the operands, and the number of layers. All of these can have an impact on how that’s going to get executed in hardware and how the accelerator is going to handle that. So it’s not just a hardware implementation. It’s actually a system implementation. There’s also balancing the computation and the communication. You don’t want to have so many multipliers that you can’t get the data into it, and you don’t want your data pipes or your caches so large that you’re wasting that memory. It’s really a balance between the computation and the communication capabilities that you’re bringing into the mix.
Woo: To underscore some of the things Russ just said, you’re getting constraints from two sides. One of them is that you have to think a lot about your use case and the environment that you’re going into, and that’s going to dictate the size and how much power, for example, and other things, such as the kind of thermal solution that you’ll have. On the other side of it, you have to think a bit about the algorithms and the application side of what you’re trying to do. The hard part is that there’s a lot of change going on right now. It seems like every 20 minutes there’s a new paper that comes out about, ‘Hey, I’ve got this great optimization.’ There’s a lot of really good progress, so you’re constantly trying to figure out how to serve the applications that are out there very well, but leave enough flexibility for the next thing. The balance is really important between the resources. It’s part of why you’ll see these curves that people put together, such as arithmetic intensity curves, also referred to as rooflines. What they’re trying to do is figure out, for certain types of neural network applications, how well this architecture will perform. Will it be limited more by compute, or will it be limited more by bandwidth? What are the primary things that really matter? So you’re trying to adjust the design of the architecture to get the right kind of architectural performance, like through rooflines, to make sure that the range of applications you care about will perform really well.
Chole: I’ll try to be like NPU architects in the last nine years. Nine years ago, when you were looking at this problem, you were looking at it as kernels, like you get an operator, you try to optimize the operator on top the best way you can. Now that technology is not the limitation. Maybe you can put as many MACs as you want, as long as you have the area budget, but maybe not the power budget. The problem is not just solving one kernel. The problem is how to solve the entire network because, as everybody mentioned, it’s an accuracy problem. It’s a bandwidth issue. There are different bottlenecks in the system. Maybe your runtime is not able to keep up with feeding the NPU the data that you need, for instance. So, at a level of optimizing an application, you can’t just look at a kernel and try to solve a kernel. You have to look at the workload. And I’m not saying this is the final workload, that this is the only thing you’re going to run. But you need a representative workload to say what you’re going to optimize for. This is because, as an architect, if I don’t get the tradeoff data, I can’t really make tradeoffs. And I don’t want to get that tradeoff data just from kernels. I want to see, for the entire network, that this is what the tradeoff is in terms of latency, in terms of power, in terms of accuracy. And whenever we say PPA, that’s a hardware way. But for an application, there is another ‘A’ that comes in, the accuracy of the network. So it’s PPAA that becomes very important.
Roddy: I concur with Sharad on the need to look at total system performance and full network accuracy when picking the elements of an AI acceleration solution. Is the entire NPU programmable and flexible? Or is only one component programmable, severely limiting the range of networks and operators that can be supported? Can the embedded programmer, using the completed chip, select different bit precisions and number formats, and any new operator they choose to make layer-by-layer accuracy tradeoff decisions? Or is that embedded AI engineer forced into a limited set of quantization schemes, bit precisions, and limited operator sets? An NPU solution that empowers developers to make their own tradeoffs years down the road will contribute to a longer-lived, more financially successful chip project.
SE: How are those workloads captured and taken into account? How are they represented?
Roddy: Customers evaluating NPUs always start with a set of representative benchmarks in the open source, to try to find yardsticks that all competing vendors can present performance data for. But most sophisticated customers realize there are way too many variables present in such data. How was the network quantized? Were operators substituted? Did the vendor inject sparsity? Were classes removed from detectors? Without hands-on evaluation themselves, they can’t be sure to get an apples-to-apples comparison. And once a customer decides to go hands-on with a vendor toolchain, they almost always switch to proprietary models — the ‘real model’ – for the analysis. This poses challenges for the NPU vendor because a customer may have networks that are in PyTorch, ONNX, or TensorFlow. Customer networks often have pure Python code or CUDA C++, in addition to graph operators. The successful NPU toolchain needs to enable the end user to import model code in all of those formats and languages and successfully port a model to the target NPU.
Chole: Typically, it’s just a model, like open-source models, that come in. Sometimes customers have their own custom models, but initially, they might not even want to share them. We basically get some open-source representative of these models, but that’s just a model definition. What really happens is that the application is a pipeline of models. For example, if you look at vision language models (VLSs), you basically have an encoder, you have the LLM part, and then you have a decoder, and all of these things have to work together to be able to give a final output, this end-to-end pipeline latency. What’s the memory requirement like? Each model will have different memory performance characteristics or different accuracy requirements, so the tradeoff becomes important. And to be able to make that tradeoff, the building blocks for the NPU need to be significantly configurable. This is why the NPU becomes different from any other architecture, like DSPs or GPUs, where each building block needs to be configurable in terms of memory and in terms of precision. It can be NPU 16, it can be intake, it can be one in four. The compression schemes to reduce the memory bandwidth come with the quantization. This means each NPU block needs to be very configurable, and at the end, the configuration that we actually go forward with is defined by the workload characteristics, not by single operator characteristics.
Woo: There’s one other thing to mention and be aware of. There are industry efforts to develop more standardized benchmark suites. MLPerf has been one that the industry has looked at as a way to kill two birds with one stone. One is to show you more complete applications. The other is to try to provide some way to do an apples-to-apples comparison. There are still a lot of ways you can optimize within that, such as changing precision and things like that, but it’s an attempt to try and get something out there that is a way to measure different architectures.
Cooper: I would agree with that. I think MLPerf sometimes lags the market, where there are customers that ask for things that are a little bit more forward-looking in terms of the model, but certainly, there’s usually some lag between a paper coming out and somebody trying to implement it, and then MLPerf says, ‘Let’s make that one a benchmark.’ But it’s a good place to start for workloads. Then they break it up into multiple categories of edge, tiny AI, or server class, because there are different classes of NPU use cases.
Leave a Reply