The best ways to optimize AI efficiency today, and other options under development.
AI is impacting almost every application area imaginable, but increasingly it is moving from the data center to the edge, where larger amounts of data need to be processed much more quickly than in the past.
This has set off a scramble for massive improvements in performance much closer to the source of data, but with a familiar set of caveats — it must use very little power, be affordable, and at least some of the processing must happen in a device no larger than a smart phone. While algorithm training will continue in the cloud, the real competition is heating up at the edge due to the high cost of moving large quantities of data over long distances. The shorter the distance, and the more data that can be processed locally, the lower the cost and the faster the time to results.
Realizing those benefits isn’t easy. It requires a much deeper understanding of what type of data is being processed and how it will be used, a scenario that has been playing out in large data centers for the past half-decade as companies such as Google, Tesla, Meta, and others design custom chips for their specific needs. On the edge, it all starts with focusing on the target use cases and defining the necessary features to address them.
“It’s tempting to add features here and there to address other potential markets and use cases, but that often leads to increased area, power, and complexities that can hurt performance for the main applications of the chip,” said Steven Woo, fellow and distinguished inventor at Rambus. “All features must be looked at critically and judged in an almost ruthless manner to understand if they really need to be in the chip. Each new feature impacts PPA, so maintaining focus on the target markets and use cases is the first step.
The biggest benefit of processing at the edge is low latency. “Edge really shines when a decision must be made in real-time (or near real-time),” said Ashraf Takla, CEO at Mixel. “This ability to make decisions in real-time provides other ancillary benefits. With AI, devices can improve power efficiency by reducing false notifications. Processing at the edge also reduces the chance for a security breach due to transmission of raw data to be processed somewhere else. Where connection costs are high, or when a connection is limited, processing at the edge may be the only feasible or practical option.”
Still, the edge is a broad category, and the rollout of AI — including generative AI — adds even more pressure to create the right mix. Use cases and applications can vary greatly, and they need to be considered very narrowly in the design process.
“Some options include how the chip will be powered, thermal constraints, if it needs to support training and/or inference, accuracy requirements, the environment in which the chip will be deployed, and supported number formats, just to name a few,” said Rambus’ Woo. “Supporting large feature sets means increasing area and power, and the added complexity of gating off features when they aren’t in use. And with data movement impacting performance and consuming large amounts of the power budget, designers need a good understanding of how much data needs to be moved to develop architectures that minimize data movement at the edge.”
Making the right choices
Specific details of the major use cases will drive different tradeoffs.
“A set of decisions needs to be made around the amount and type of data communicated to the edge AI processor,” Woo said. “Is the chip receiving only inferencing data, or does it include model updates? Does the chip need to perform full training or fine tuning based on specific data it sees? What other chips and systems is this processor communicating with, and how often? Will there be long periods of inactivity during which the chip enters a deep power-down mode, or will it be on most of the time? The answers to these questions will drive decisions on the compute engine architectures, amount of on-chip SRAM storage, and whether to use external DRAM (as well as the type and capacity).
There are other considerations, as well. “We need to take a step back and optimize the AI architecture first,” said Benjamin Prautsch, group manager for advanced mixed-signal automation in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Careful co-design of data pre-processing, AI algorithms, and post-processing based on both the application context and the data quality, is essential – and should be treated as model-based. Regarding power consumption, we saw that time-series signals (e.g. audio), analog crossbar designs, or even spiking neural networks can bring significant improvements.”
Another concern is that AI is a fast-changing technology, so in addition to striking the right balance between power, performance, and area/cost, there needs to be flexibility built into designs.
“On one hand, you have something generic like a CPU, which gives you the most development/programming flexibility and future-proofing capabilities, but likely has the largest area and worst power efficiency,” said Amol Borkar, director of product marketing for Tensilica Vision & AI DSPs at Cadence. “On the other end of the spectrum, you have fixed-function hardware accelerators that have the best area and power combination, but almost no post-design flexibility. If there is a change in the spec or workload requirement after the design has taped out, there is almost no way to modify that block without doing a silicon re-spin, which is both time consuming and expensive.”
Chipmakers and systems companies are at all points on this spectrum. “From flexibility to efficiency, one would go from CPU to GPU, then to DSP, NPU, and finally ASIC or RTL,” Borkar said. “This choice is also dictated by market maturity and requirements. For example, ultra-low power always-on (AON) is a fairly hot market, and one would expect you would need a low-end NPU to handle AI for AON. Over time, there is a common realization that there is no need for a lot of flexibility for this segment, as we are not going to encounter 20 to 30 different types of workloads. Rather, AON appears to rely typically on 3 to 5 common AI networks (or their variants). Hence, we are seeing many companies typically deploy solutions that use fixed-function RTL or ASICs to address this need.”
In contrast, ADAS and mobility are rapidly evolving. Requirements are increasing every quarter, moving from hundreds to thousands of TOPS, and so are the number of use cases.
“Today we are relying heavily on transformers, point pillars, etc.,” he said. “In the near future, it will be something new. In this evolving market, selecting RTL would severely hamper longevity and future readiness, so using an NPU or NPU+DSP is best combination for flexibility, performance, and efficiency.”
Process technology offers another knob to turn for power, performance, area/cost (PPA/C) tradeoffs.
“The easiest way to get better PPA — but notably not C — is to take advantage of both Moore’s Law and Dennard scaling by using the most advanced process node,” explained Jeff Lewis, senior vice president of business development and marketing at Atomera. “But this course has multiple problems. It is typically prohibitively expensive and lacks embedded non-volatile memory, which is typically a must-have for IoT devices. This eliminates nearly all non-planar processes, except some of the oldest finFET processes that have recently added MRAM or ReRAM.”
Because AI requires high performance, anything above 40nm is a stretch. “Given that, the next question is the usage model,” Lewis said. “Is it always inferencing at high speed? Or, more typically, mostly running at a low-intensity ‘monitoring’ mode and only inferencing when something occurs? For the latter, the two most important elements are dynamic voltage and frequency scaling (DVFS) and low-leakage library elements. DVFS provides high performance when it is needed, and then lowers supply voltages and operating frequency when it isn’t. Low-leakage library elements, particularly embedded SRAMs, require a substantial proportion of the system power. Both of these interact. DVFS is limited by how low it can drop supply voltages. This is almost always determined by the minimum voltages where the SRAM operates or, if in standby, where the SRAM will still retain data but not read or write — known as Vmin. These minimum voltages are determined by the statistical variation among the bit cells. Lowering Vmin requires a reduction in mismatch — and the primary contributor is the random dopant fluctuation in each bit cell. Since power is proportional to voltage squared, lowering Vminfrom, say, 0.7V to 0.5V will cut the SRAM power in half.”
There are other approaches, with sparse acceleration just one example.
“The edge is a big place. Customers care about all sorts of things—performance, capability, efficiency, price, reliability, ease-of-use, ease-of-integration, etc. but in vastly different orderings and priorities. We could be talking about training AI locally on a desktop or deploying AI inference on a hearing aid—very different use cases and power profiles. Or we could also be talking about running AI in a throw-away party favor or in a multimillion dollar piece of industrial machinery—very different cost profiles,” said Sam Fok, CEO at Femtosense. “In any case, Femtosense’s thesis is that for AI edge inference, sparse acceleration helps and by a lot. Exploiting sparsity—the zeros in the parameters and activations of an AI model—allows customers across the board to unlock large AI and its performance without the headaches of large silicon when the hardware supports it. In addition to usual good architecture ideas, our hardware includes custom instructions and memory formats for sparse math acceleration—do not store zeros (space), do not pull them out of memory (time and energy), and do not operate on them (time and energy). Sparse acceleration saves space, time, and energy with margins that grow as AI models scale, and there is no doubt that scale matters.”
Analog to the rescue?
While partitioning and network connectivity are important architectural choices, there is an ongoing discussion about whether the largest architectural contributor to edge PPA efficiency could be analog neural networks. But are analog neural networks appropriate for the bulk of AI accelerators?
“To the degree that people are building AI accelerators, and they’re doing it digitally, high-level synthesis is one of the technologies that they could use to create that accelerator,” said Russell Klein, program director for high-level synthesis at Siemens EDA. “But every time I look at what’s being done in the analog space, the benefits are just so enormous that it’s really hard to look at it and say, ‘Nah, we need to keep going digital.’ When we create a neural network, we’re effectively creating a model of a biological neuron. It’s got some number of inputs that are going to be coming into it. The biological neuron is going to be taking voltages that come in, and it’s either going to be amplifying some of those voltages or diminishing some of those voltages. And if things get diminished enough, it will disconnect from whatever was feeding it because it says, ‘Oh, that input is no longer worthwhile.’ And then it sums all of them up, and fires up an output. That entire process is entirely analog. So we’re now going to try and model this, and not model just one neuron, but we’re going to model many thousands, tens of thousands, maybe even billions of them, and pull them all together. And what we’re effectively doing is creating a digital simulation of that analog process.”
That isn’t very efficient, though, Klein noted. “It seems like it would be so much more efficient just to implement the circuitry as analog, because a 32-bit floating point multiplier is going to have tens of thousands, maybe hundreds of thousands of transistors, and a transistor is a multiplier. If we’re looking at voltages, they’re not binary ones and zeros. We’ve got a gate to the transistor, and based on that voltage going up and down, the voltage on the emitter either has a greater range or a smaller range, so it’s going to be a multiplier or a divider. And if we can take tens of thousands of transistors and reduce the function to just one transistor, that’s going to be an enormous area savings — and enormous energy savings. If you take all the multipliers and you have them become single transistors, then use Ohm’s Law to calculate the output of these, you’re just going to be wiring the outputs together. Then you feed that into an activation function. An activation function likewise is a single transistor. You now have an analog representation of that neuron that is a handful of transistors and some connections, not tens of thousands of transistors that perform a single multiplication.”
This makes sense in theory, but it’s a lot harder to make it all work. “Analog neural nets have always been a somewhat esoteric niche of the machine learning world,” said Steve Roddy, chief marketing officer at Quadric. “There are no ‘popular’ training tools for building analog neural networks. 99+% of the data scientists doing ML research work in the digital domain and use common tools like Pytorch and Tensorflow (and others) to create their innovative architectures. Very little development of novel networks occurs for analog solutions other than those built by the analog compute researchers. So while analog NN fans focus on novel materials for phase change resistors, or compute-in-memory analog multipliers, the real breakthroughs in data science (e.g. LLMs) all occur in the traditional digital domain of GPUs and GPNPUs and ML accelerators.”
There is already significant friction in the system when taking a new ML network trained in the cloud (infinite compute, high-precision floating point data representation) and converting it back to the real world for deployment in edge devices. Roddy differentiates “real world” from data “born” in devices as integers. “Think RGB data read off a CMOS sensor. There are no CMOS sensors that generate floating point outputs. Images used in training have all been converted to float in the data center. That conversion to integer format (quantization) for use in an edge device is an art that isn’t fully automated. Taking that one huge step further to convert to some form of analog format is likely a bridge too far for the vast majority of practitioners. Thus, instead of a compute solution that can run hundreds of different digital NN models, an analog NN deployment is likely to be a very narrowly tailored, single-function solution that was meticulously handcrafted — almost the textbook definition of ‘niche.’”
Prem Theivendran, director of software engineering at Expedera, agrees. “How many people are using analog AI chips out there? At this moment, zero. Look at traditional environments like ASIC development, like digital space has been used for CPUs, GPUs, all kinds of custom ASICs. And nowadays people say, ‘Hey, you know I can save a lot of power by just using analog.’ Maybe we tried using analog for other market segments, but we didn’t really deploy it. And now we’re trying to deploy it for AI. Why AI specifically? It’s just because of these MAC units. And to be honest, when you say analog, the whole thing is not analog. Only the MAC — the compute part — is analog. Everything else is digital. So only 50% of the chip is analog. I have friends in this space who are keeping their fingers crossed it will work. These chips are coming back to the lab and they don’t know how it will perform with temperature variations because it’s very sensitive. It is good technology, but 90% of the problems we’re facing right now are on the software side. When you come into hardware, you’re talking about PPA, you’re talking about power efficiency, TOPS per watt. We need to handle that in a different way. Then, the big question is whether you can run ChatGPT. Can you run large LLM models? To run ChatGPT or LLMs you need a lot of memory. Can you work with a lot of memory at the same time? Can you switch between jobs? We’re still trying to figure out how to do analog compute. We’re not even extending that into the bigger realm of what AI requires at this point.”
The future
At least for the immediate future, optimizing PPA/C for an AI edge accelerator or processor comes down to fundamentals, not novel approaches like analog neural networks.
Marc Swinnen, director of product marketing at Ansys, said the best PPA is still going to come from bespoke digital silicon. “If you have a facial recognition chip that’s going to be built into millions of cameras and that’s going to be used for security cameras all over the world, you can significantly get much better PPA for the amount of power and the cost if you build a custom chip to do exactly that — run your algorithm and interface with your camera exactly the way you want it. That will give you better PPA than using an off-the-shelf solution that you would program through a generic firmware or software solution, or even a generic AI chip that’s not necessarily built for your algorithm. The problem is that making a chip is expensive. Do you have the volume to recoup that NRE for actually designing a chip yourself? If yes, bespoke silicon can give you the better solution. But you need the volume to be able to make up the large expense of designing it yourself.”
When it comes to opportunities to reduce power within the design, these are still at the high levels of abstraction — the system and architectural level. “As you move down to the later stages of the design flow, those opportunities diminish,” said William Ruby, product management director for the EDA Group at Synopsys. “You have more opportunity to do automatic optimization within implementation and synthesis tools, for instance, but now you’re more constrained to a smaller percentage improvement. An interesting thing about the power saving potential curve is that it is not a smooth curve from architecture to sign-off. There’s an inflection point at synthesis that says, before the design is really mapped to gates to the technology you have many more degrees of freedom. Then you have this inflection point and things drastically drop off. What do we need to do to mitigate the thermal issues? You need to improve the energy efficiency to lower power.”
Related Reading
AI Accelerator Architectures Poised For Big Changes
Design teams are racing to boost speed and energy efficiency of AI as it begins shifting toward the edge.
AI Races To The Edge
Inferencing and some training are being pushed to smaller devices as AI spreads to new applications.
Leave a Reply