The application of old techniques to new problems only gets you so far. To remove limitations in AI processors, new thinking is required.
Software and hardware both place limits on how fast an application can run, but finding and eliminating the limitations is becoming more important in this age of multicore heterogeneous processing.
The problem is certainly not new. Gene Amdahl (1922-2015) recognized the issue and published a paper about it in 1967. It provided the theoretical speedup for a defined task that could be expected when additional hardware resources were added, and became known as Amdahl’s Law. What it comes down to is that the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement.
This concept is becoming particularly relevant when applied to machine learning, and particularly inferencing. Dedicated chips have large arrays of multiply/accumulate (MAC) functions that are the most time-consuming function for inference. But if time spent here was theoretically reduced to zero, what would now consume all of the time?
There is no definitive right answer to this question, and that lack of definition is driving a vast array of silicon being developed today. Companies must evaluate which end-user tasks they want to optimize and target those tasks. At the same time, because the field is evolving so rapidly, they must maintain a degree of flexibility to be able to handle other tasks, or variants of the tasks originally targeted.
The MAC arrays have received a lot of attention. “People have been looking at the number of MACs as a measure of the performance that you can achieve,” says Pierre-Xavier Thomas, group director for technical and strategic marketing for Tensilica IP at Cadence. “To some extent, this is parallel processing. How many of those operations need to be done to realize a neural network? But there are a lot of other layers in a neural network workload. You have to make sure you are not optimizing just a part of it. This is why you need to look at all of the layers in order to make sure that your performance is right, that you get the right frames per second or number of detections per second.”
That just moves the problem, though. “If you can’t parallelize a section, the serial section determines the maximum speed up,” says Michael Frank, fellow and system architect at Arteris IP. “If you assume that you can parallelize infinitely at times, the parallel section becomes limited by the bandwidth that you have available in the system. So there is a maximum that I can parallelize, because I run out of bandwidth.”
It becomes the law of diminishing returns. “It’s pretty clear, and Amdahl’s Law has run out of juice,” says Manuel Uhm, director for silicon marketing at Xilinx. “You can’t just keep scaling processors infinitely and get infinite back. It just doesn’t work that way anymore. It’s becoming increasingly complex as you go more and more parallel, as you incorporate different architectures, different ISAs. It’s an increasing challenge.”
But what of the other functions that remained serialized? “Finding candidates to migrate into hardware isn’t that hard, but you need to consider parallelism,” says Russell Klein, HLS platform director for Siemens EDA. “For example, the creation of an accelerator to perform a multiply function in hardware runs at the same speed as the multiplier inside the CPU. What makes hardware faster is that it can do things in parallel. You can do 1,000 multiplies at the same time in hardware. Moving a serial algorithm off the CPU into hardware isn’t going to help much. In fact, it will probably make things worse. Anything moved into hardware has to be able to take advantage of parallelism.”
In addition, it is a moving target. “You need some flexibility and programmability because there are a lot of neural networks being developed,” says Cadence’s Thomas. “It is a very active area for research. You have to be asking, ‘How do you make the algorithms more efficient? How do you limit the number of weights? How do you limit the bandwidth?'”
Performance analysis
The analysis of hardware architectures is becoming increasingly important. “We increasingly hear people saying that while your simulator is great, and it lets me get my software up and running, it doesn’t tell me everything,” says Simon Davidmann, CEO for Imperas Software. “They can simulate 64 cores, and they can choose how they are interconnected. They can model the network-on-chip (NoC) and set up shared memory. But then they ask, ‘How fast is it going to run? Where is the limit? What are the problems?'”
That becomes increasingly difficult for AI. It is not a simple mapping process, and the compiler in between can have a huge impact on overall performance.
“We look at the most stressful networks, feed those into the compiler and see how it’s doing,” says Nick Ni, director of product marketing for AI, Software and Ecosystem at Xilinx. “There is a measure that we use to determine how effective the compiler is — operational efficiency. The vendor says they can achieve a certain number of TeraOps when you’re making everything busy. If your compiler generates something that executes with only 20% efficiency, chances are there’s probably room to improve. Either that or your hardware architectures are obsolete. Chances are it is because it’s a new model structure that you have never seen before. The application of old techniques may result in bad memory access patterns. You see this all the time with MLPerf and MLCommon, where the same CPU or the same GPU improve over time. That is where the vendors are improving the tools and compilers to better optimize those and map to specific architectures.”
This often requires a more holistic approach. “Performance is about the system, not the processors,” says Imperas’ Davidmann. “From a processor-centric view, you might say I need to have more caching, or I need to change my pipeline, I need to do out-of-order execution, and all of those things are great. But actually, the overall performance of the system is not completely controlled by the individual processor, it is controlled by lots of other things. Things like the memory and the architectures. There are going to be limitations in the throughput that any given piece of hardware can achieve.”
Those limitations have to be fully understood. “Consider you have many small processor cores that are connected using a mesh network,” says Xilinx’s Uhm. “When you’re programming in that type of architecture, it’s not just writing 100 kernels that operate in parallel. You actually have to define the data flows, and you have to consider how it maps to the memory structure because you only have so much tightly coupled memory. You can access adjacent memory, but you run out of adjacencies to be able to do that in a timely fashion. All that does have to be comprehended in the design.”
Bandwidth limited
In many cases, the cause of low processor utilization is a memory limitation. “Memory is a limiting part in this whole equation, mostly because bandwidth is limited,” says Arteris’ Frank. “Our processing capabilities have grown significantly faster in the past compared to memory. Memory bandwidth has been growing by a factor of two, while computing capabilities have gone up by factors of 10 or more.”
The problem compounds. “The more parallelism you add, the more data you have to be able to move to those elements,” says Thomas. “You need to bring the same amount of data to each of the processors, and if that data resides in the outside world, that means that you need to bring more data into the SoC. So potentially more sensors, bigger memories, bigger buffers so that you have more data close to the processing elements.”
The techniques of the past may not be good enough for the future. “This is why we have ideas like caches, where the cache can be used as a bandwidth multiplier,” says Frank. “But even if you have an on-chip cache, and you have multiple engines trying to feed off the cache, eventually, you run into the problem that your cache doesn’t have enough ports, or it doesn’t have enough clock cycles to feed multiple engines. At that point, you can consider replicating data in multiple caches. You have to keep looking at new techniques because you’re limited by external bandwidth, and that is defined both by physics, and by power.”
For some applications, power is the limiter. “There is increasing concern being placed on performance per Watt,” says Thomas. “What is the energy and power dissipation associated with a solution? This is especially true for AI at the edge. Intelligent sensors and other things require us to look carefully at the power and energy dissipation. Big silicon area may mean more leakage power, so you may want to incorporate techniques to shut down the power. But there is a latency cost associated with those techniques, so there has to be a balance between performance and power.”
Synchronization
Synchronization, scheduling, and task allocation have long been a limiter for parallel processing. “There is a sequence of layers that need to be processed, but even if the time for all those goes to zero, there is still synchronization,” says Thomas. “There is likely to be some dependencies between different tasks, things that have to happen before you are able to start the next one. There are synchronizations going on, making sure that data is in right place and ready to go. Those are things that you would like to reduce to zero as well. The reality is, not all the task, not all the dependencies finish at the same time.”
People have been trying to solve this problem for a long time. “One company built this fabric of small processors,” says Davidmann. “They split the software up, and these processors were all synchronously connected where the data moved from processor to processor in a controlled synchronous way. Their performance was fixed by design, rather than guessed at and analyzed. They built this system so that certain jobs would run effectively.”
But that does not provide the degree of flexibility that most AI hardware desires. “Static scheduling would require that at a certain time, a certain unit is ready for execution,” says Frank. “If you’re assuming that you have large systems with a sea of accelerators, you cannot predict that an accelerator will be ready by the time it is supposed to be ready. There is always jitter and slack. Static scheduling works for a very limited set of algorithms.”
There will most likely need to be a dynamic element to scheduling. “AI chips contain CPUs for flexibility and to implement layers that are not standardized,” adds Frank. “And there is exactly this problem — you have this CPU sitting there and doing this nonlinear function. How do you, with smallest overhead, synchronize these two streams of data. You have the data edge, which delivers the particular data you want to process, and you have the CPU that doesn’t want to spend time trying to find out where the data is. We have shown that you can take a RISC-V processor, and by adding a few instructions for issuing tasks, generating tasks, announcing their dependencies, and then getting the next task ready for execution, you can achieve a 13X speedup of OpenMP for small tasks.”
Looking forward
The industry continues to learn. “You would traditionally look at a problem and say the pinch point is there,” says Davidmann. “Wherever it has to be sequential is a problem. What we’re doing is learning how to build new types of algorithms, which don’t rely on the concept of single-threaded serialization. There always needs to be some point where things come together, but people are becoming more adept at working out how to map these problems to architectures that aren’t governed by the likes of the Amdahl’s limits. In terms of architecture, Amdahl is still important because the architecture has to be right. If you get it wrong, you’ll find serialization points.”
We also need to define new metrics for optimization that look at things other than raw performance. “Looking at Amdahl’s Law means you are focused on how to improve performance,” says Thomas. “But the key aspect today is to improve performance per watt. You do not have unlimited hardware. You need to keep in mind how to scale and how much energy you require for the task that you need to process.”
Conclusion
While parallel processing is not new, the industry has rapidly transformed from a small number of SMP (symmetric multiprocessing) cores to massively parallel heterogenous processors. The techniques of the past have enabled many of the initial hurdles to be overcome, but they will not provide the answers going forward. The removal of bottlenecks is a combination of architectural issues for hardware, for algorithm development to be aware of issues such as data locality, and for intelligent compilers that understand the flow of data through those architectures.
Leave a Reply