How much do we pay for a system to be programmable? It depends upon who you ask.
Programmability has fueled the growth of most semiconductor products, but how much does it actually cost? And is that cost worth it?
The answer is more complicated than a simple efficiency formula. It can vary by application, by maturity of technology in a particular market, and in the context of much larger systems. What’s considered important for one design may be very different for another.
In his 2021 DAC keynote, Bill Dally, chief scientist and senior VP of research at Nvidia, compared some of the processors his company has developed with custom accelerators for AI. “The overhead of fetching and decoding, all the overhead of programming, of having a programmable engine, is on the order of 10% to 20% — small enough that there’s really no gain to a specialized accelerator. You get at best 20% more performance and lose all the advantages and flexibility that you get by having a programmable engine,” he said.
Fig. 1: Bill Dally giving a DAC 2021 keynote. Source: Semiconductor Engineering
Later in his talk he broke this down into a little more detail. “If you are doing a single half-precision floating-point multiply/add (HFMA), which is where we started with Volta, your energy per operation is about 1.5 picojoules, and your overhead is 30 picojoules [see figure 2]. You’ve got a 20X overhead. You’re spending 20 times as much energy on the general administration than you are in the engineering department. But if you start amortizing (using more complex instructions), you get to only 5X with the dot product instruction, 20% with the half-precision matrix multiply accumulate (HMMA), and 16% for the integer multiply accumulate (IMMA). At that point, the advantages of programmability are so large, there’s no point making a dedicated accelerator. You’re much better off building a general-purpose programmable engine, like a GPU, and having some instructions you accelerate.”
Fig. 2: Energy overhead for various instructions. Source: Nvidia
That does not sit well with many people, and it certainly is not reflected by the billions of venture capital flowing into AI accelerators.
“If you focus on a very small piece of a processor, the ALU, and you try to create a cost of programmability, that does not give us a complete picture,” says Sharad Chole, chief scientist for Expedera. “We need to zoom out a little, see how the workload is distributed on various execution blocks, how the execution blocks are synchronized, how the communication happens for operands, as well as aggregation of results. Then zoom out even further to see, at the system level, how the data transfers happen between multiple cores and how the data transfer happens for the DDR or the global memory and back and forth.”
Many-processor architectures are being driven by the software intended to run on them. “Not that long ago, people were designing hardware and didn’t know what software to put on it,” says Simon Davidmann, CEO of Imperas Software. “Today, people are trying to work out the best hardware to run applications most efficiently. What’s happening is that the software is driving the functionality in the hardware. Products need better software, which is more efficient, and that is driving the hardware architectures.”
Processor design has evolved, too. “The paradigm of hardware has changed,” says Anoop Saha, market development manager for Siemens EDA. “In the past, hardware was very general-purpose, and you had an ISA layer on top of it. The software sits on top of that. There is a very clean segmentation of the boundary between software and hardware and how they interact with each other. Now what you are doing is hardening some parts of the processor and looking at the impact. That requires hardware/software co-design because there is no longer a simple ISA layer on which the software sits.”
Understanding the overhead means we have to start by looking at the workloads. “This is the critical first step to creating a hardware architecture,” says Lee Flanagin, CBO for Esperanto Technologies. “Workloads in AI are abstractly described in models, and there are many different types of models across AI applications. These models are used to drive AI chip architectures. For example, ResNet-50 (Residual Networks) is a convolutional neural network that drives the needs for dense matrix computations for image classification. Recommendation systems for ML, however, require an architecture that supports sparse matrices across large models in a deep memory system.”
Put simply, we have entered the age of dedicated processing architectures. “AI is asking for more than Moore’s Law can deliver,” says Tim Kogel, principal applications engineer at Synopsys. “The only way forward is to innovate the computer architecture by tailoring hardware resources for compute, storage, and communication to the specific needs of the target AI application. We see the same trend in other domains, where the processing and communication requirements outpace the evolution of general-purpose compute, such as in the data center or 5G base stations. In data centers, a new class of processing unit for infrastructure and data-processing tasks has emerged. These are optimized for housekeeping and communication tasks, which otherwise consume a significant portion of the CPU cycles. Also, the hardware of extreme low-power IoT devices is tailored for the software to reduce overhead power and maximize computational efficiency.”
Pierre-Xavier Thomas, group director for technical and strategic marketing at Cadence, points to a similar trend. “When you do computer vision, it is different than if you do speech synthesis or other types of ML on different types of datasets. Computer vision requires a certain amount of data throughput. Other application may have less data or require different throughput. Although they may have similar types of operation in some of the layers, people are looking at differentiation in the layers. It requires several types of processing capabilities. The data movement and the amount of data is changing layer after layer, and you need to be able to move the data very efficiently. That has to be balanced with the required latency.”
It is a question of how optimized for a particular task you want a processor to be. “You make a major investment when building a one-trick pony,” says Michael Frank, fellow and system architect at Arteris IP. “The way to deal with this is a combination of a general processor, a Turing-complete processor, and a combination of workload-optimized accelerators. Dataflow drives performance. But when you throw a thousand cores at a problem, how do you manage or orchestrate them efficiently?”
Part of the excitement surrounding RISC-V involves the gains that can be made by customizing the processor to a specific task set while maintaining the same level of programmability. “To produce an optimized processor, we start with a baseline and then explore various combinations of extensions,” says Zdeněk Přikryl, CTO for Codasip. “We can look at how these will impact the design in terms of area, performance, memory footprint, etc. After that, custom instructions can be looked at, and when you do that the gains can be huge. Those gains may be in terms of cycle count, or in a reduction in code size, which reduces area and the number of instruction fetch cycles that need to be performed.”
Planning for the future
AI complicates the issue. “The more defined you get, the easier it is to lock things down,” says Manuel Uhm, director for silicon marketing at Xilinx. “But that limits flexibility in what you can do with it in the future. AI and ML is an area that is evolving so rapidly, it is actually difficult to lock anything down in terms of the algorithms, in terms of the base kernels and operations, or compute functions that are needed to support that. You also define the bandwidth, being careful not to over design things. If you don’t need all the throughput that is capable of being moved, then there could be some over-provisioning. But to provide the flexibility, that’s always the case. If you don’t need that flexibility, just use an ASSP or build an ASIC. Lock all those things down and just never change them.”
AI applications on the edge often are driven by power more than anything else. “Customers want to run more and more AI, but they need to run it more efficiently,” says Paul Karazuba, vice president of marketing for Expedera. “This is especially true when you move away from the data center, and toward an edge node, where an edge node can be defined as anything from a smart light bulb all the way up to a car. There is a need for specialized AI acceleration because there is a need for power efficiency, and there is a need for performance efficiency. Just running the same engines is not powerful enough. You will start to see a lot of different architectures in chips, mostly because they can be much more efficient than others.”
The cost of memory
While the cost of fetching instructions is a consideration, it may be tiny compared with the cost of moving data. “Improving memory accesses is one of the biggest bang-for-the-buck architecture innovations for reducing overall system-level power consumption,” says Siemens’ Saha. “That is because an off-chip DRAM access consumes almost 1,000 times more power than a 32-bit floating-point multiply operation.”
In fact, the cost of memory access is leading to new architectures. “The industry is moving toward data-flow architectures,” says Expedera’s Chole. “The reason is to be able to achieve the scalability and system-level performance. If you just consider power by itself, the SRAM access power is 25 times the multiplier power that is required for integer multiplies. And DDR access power is another 200 times away from SRAM access power. There is a huge part of execution that needs to be accounted for in terms of power.”
This is leading to a rethink in the way compute and memory are connected. “Domain-specific memory concepts are being explored in the spatial compute domain,” says Matt Horsnell, senior principal research engineer at Arm Research. “As an example, DSPs tend to provide a pool of distributed memories, often directly managed in software, which can be a better fit for the bandwidth requirements and access patterns of specialized applications than traditional shared-memory systems. In order to bridge the efficiency gap with fixed-function ASICs, these processors often offer some form of memory specialization by providing direct support for specific access patterns (e.g., N-buffering, FIFOs, line buffers, compression, etc.) A crucial aspect of the orchestration within these systems, and a challenge in designing them, is determining the right granularity for data accesses, which can minimize communication and synchronization overheads whilst maximizing concurrency at the same time.”
Arun Iyengar, CEO for Untether AI, provides one example. “CPUs and GPUs make sense for conventional programming workloads, but are sub-optimal compared with accelerators built for machine learning inference from the ground up. By implementing an at-memory architecture of distributed processing elements, each abutted to local low-latency memory and connected by high-speed interconnect, we see a 6X improvement in power dissipation per multiply accumulate on our 16nm device compared to other 7nm solutions.”
The software cost
When you have programmable hardware, you need to have software, and that invariably requires a compiler. This becomes increasingly difficult with AI, because it is not a simple mapping process. Moreover, the compiler in between can have a huge impact on overall performance and overhead.
“We look at the most stressful networks, feed those into the compiler and see how it’s doing,” says Nick Ni, director of product marketing for AI, Software and Ecosystem at Xilinx. “There is a measure that we use to determine how effective the compiler is — their operational efficiency. The vendor says they can achieve a certain number of TeraOps when you’re making everything busy. If your compiler generates something that executes with only 20% efficiency, chances are there’s room to improve — either that or your hardware architectures are obsolete. It is most likely a new model structure that you have never seen before. The application of old techniques may result in bad memory access patterns. You see this all the time with MLPerf and MLCommon, where the same CPU or the same GPU improve over time. That is where the vendors are improving the tools and compilers to better optimize those and map to specific architectures.”
Tools are a necessary part of it. “The answer lies in building toolsets that guide and automate the migration of AI workloads from their inception in the cloud with infinite numerical precision and compute resources, into inference deployment within constrained compute devices,” says Steve Roddy, vice president of product marketing for Arm’s Machine Learning Group. “For embedded developers mapping a pre-trained, quantized model to a target hardware, a series of optimization tools are required that are specialized to a particular target. They are optimizing data flows, compressing model weights, merging operators to save bandwidth, and more.”
The goal is to increase utilization. “For AI inference on standard cores, people often see utilization figures around 30%,” says Chole. “That 30% is after the complete optimization by the compilers. These are standard benchmarks like ResNet, which have been around for about six years, and compilers are optimizing them like crazy. Yet even after that, if you actually look at the benchmarks, the utilization usually is around 30%. By designing deeply pipelined architectures that don’t have any stalls, and where the entire architecture is deterministic, we are able to achieve 70% to 90% utilization on most AI workloads.”
Others see similar benefits. “People have realized there are certain common parts of algorithms, especially in machine learning, where you are running vector matrix multiplies and where you can go and build a generalized accelerator that is useful for multiple algorithms,” says Arteris’ Frank. “If the data flow is not predictable, and you have systems that have multiple different applications, you cannot just think about the processing side or the compute side to be generic. You also have to think about how the dataflow is going to be somehow generic.”
That involves system-level control. “The efficiency of the control is about going from one task to another,” says Cadence’s Thomas. “You don’t want to waste too many cycles scheduling the next task. The ability to efficiently use your datapath is important so that you can reach the required latency while the power and energy is useful. Controlling and dispatching tasks on different specialized hardware is very important. For some applications, it may be less important, because you may have a workflow that is completely datapath oriented, such as in computer vision. But for other applications, like communication, you need to do a lot of signal processing. At the same time, you are looking at characteristics of your communications, and you have control code in order to determine what kind of signal processing algorithm you need to achieve. Then you need to go through the control code, as well.”
This can get complicated. “Assuming you have large systems with a sea of accelerators, you cannot predict that an accelerator will be ready by the time it is supposed to be ready,” says Frank. “There is always jitter and slack. In order to fully automate it, there are compilers that know about data flow, like OpenMP, that has the concept of tasks with dependencies. And the runtime system manages this. So static scheduling does work for a limited set of algorithms. But if you really want to make it shine, you need to do it dynamically, at least with a certain dynamic.”
There are different ways in which programmability can be provided. “In our architecture, we have packets that move through the processor,” says Chole. “Those packets contain both data and the function that needs to be performed on that data. That is a different type of programmability than you would find in a CPU or GPU because they are tuned for general-purpose execution. But the question is, given the flexibility and the level of utilization, is there a tradeoff between the amount of programmability the customer is willing to pay for, and the cost for that programmability in terms of power as well as the area?”
Most solutions provide at least some degree of programmability. “It is very clear that programmability is key in order to keep up with ML, neural networks, etc.,” says Thomas. “The software part of it is becoming more important. There is always the hardware that is designed to be a little more functional. But in general, people are looking more carefully at the intended functionality. How are they going to be used, because there is a cost associated with die area? There will be leakage power. There is a lot of attention directed toward the type of primal engines needed, and if this programmable engine is efficient enough in terms of power. It has to be powerful enough to run the application I have today, but also to be able to handle those that come in a few months or years.”
Conclusion
Programmable engines are not efficient, but that is a cost that the industry has been willing to bear for four decades, and it is not about the change. The objective with every design is to minimize the overhead as much as possible, while maximizing flexibility. Acceptable overhead is something defined by end use cases and is a combination of power, latency, performance, area, code size etc.
Efficiency has to be looked at from a system perspective, because while it is great to optimize small parts of the problem, that effort can be swamped by other parts of the system. With AI systems, there is also the future that has to be looked at. Nobody knows exactly what will be required in 12 or 18 months, but the chip you are designing right now has to be ready for it, and that means you have to accept an additional level of overhead today.
Leave a Reply