Optimizing processor architectures requires a broader understanding data flow, latency, power and performance.
The rapid growth and dynamic nature of AI and machine learning algorithms is sparking a rush to develop accelerators that can be optimized for different types of data. Where one general-purpose processor was considered sufficient in the past, there are now dozens vying for a slice of the market.
As with any optimized system, architecting an accelerator — which is now the main processing engine in many SoCs today — requires a deep level of understanding about what data is going to be processed and where. So data flow and latency are just as important as the balance between power and performance, and system architectures are becoming much more complex than a series of components connected by a wire.
The general idea is to process algorithms faster than ever before using as little power as possible, whether at the edge or in the data center or somewhere in between. But with machine learning algorithms moving faster than the hardware can keep up, accelerator architectures are all over the map. They include everything from ASICs to GPUs, FPGAs and DSPs, as well as hybrid versions of those.
“On the edge, accelerators are used for communication as the speed of connection connectivity is increasing,” said Lazaar Louis, senior director, product management, marketing and business development for Tensilica IP at Cadence. “Accelerators are used for some things in the case of vision in areas where they feel it’s a well known implementation. Algorithm use of accelerators happens in the case of AI as well as audio.”
Inside the data center, general-purpose processors like CPUs or GPUs typically are used for inferencing, although it is becoming clear that better performance and power efficiency can be achieved by building a special accelerators there.
All told, this adds up to a lot of opportunity to turn up the performance while keeping power to a minimum.
Accelerator architectures
So what, exactly, does an accelerator architecture look like today? To Eric van Hensbergen, Arm fellow and director of HPC, there’s no obvious pattern.
“There is such a proliferation of accelerator technologies and no standardization,” van Hensbergen said. “One of the things that’s been a strength of Arm is the strength of our instruction set as a concrete standard that a stable software ecosystem can be built around. If we get into an environment where it’s all accelerators and we don’t step up and work with our partners to standardize different classes of accelerators and so forth, then we lose a little bit of that advantage of the standard ISA. My team, together with the architecture team and other groups within Arm, have been working on looking across the spread of different acceleration technologies and starting to think how to standardize but still let our partners shine in their own ways, because we don’t want everyone to look exactly the same.”
He noted that accelerators are unique in that they’re not an instruction set architecture (ISA). They’re an application programming interface (API). “What does that look like and is there enough commonality amongst different APIs and different classes? Can we find a library? Do we integrate it into the ISA? There are a bunch of factors. As a researcher, I don’t know that I’ve seen someone come forward with, ‘This is the general solution,’ or even, ‘This is the solution for this class of accelerators.’ It’s an interesting research question, and that’s why we’re involved and looking at it.”
What the winning formula ultimately will be is unknown.
“There are buckets of designs,” said Ludovic Larzul, CEO of MIPSOLOGY. “This is a CPU, this is a GPU, network chip, and you have like maybe 5 or 6 different buckets. Today a lot of accelerators are the same. They are around matrix multiply implemented as a weighing of operations in a 2D array of multipliers/adders with some shifters for weights and values.”
Ideally, hardware would be a one-to-one match with the neural network, which is obviously impossible because it would mean having one set of hardware per neural network, he explained. “So the game of doing hardware architecture for accelerators is how do you create an architecture that is not using much silicon, whatever the implementation—whether it is FPGA or ASICs or something else. And this is not simple because of the size of neural networks, which are growing all the time. It’s not simple because it’s always a balance between the computation, the amount of memory you need to save the data related to those computations, and the distance between the data and the place you have to compute it.”
Still, with no standard accelerator architecture, everybody’s trying to put a stake in the ground with at least something good enough that can be a standard.
“In many ways, it’s no different than what was done before,” said Stephen Bailey, director of strategic marketing at Mentor, a Siemens business. “It’s just a continuation of the integration of more and more processing capabilities and the specialization of them. SoCs today already have a whole bunch of accelerators that were big a long time ago that are now standard, or were new a long time ago and now are standard. It’s the same thing here—new applications, and finding ways to accelerate them in hardware.”
What needs to be accelerated
As with any other type of optimized hardware, the how of acceleration depends on what must be accelerated. This depends heavily on the software algorithms, and sometimes there are multiple algorithms in the design.
“It’s software-driven hardware design, so it’s kind of backwards,” said Kurt Shuler, vice president of marketing at Arteris IP. “In the old days you’d build a chip and then say, ‘Given this chip, what software can I run on it? And how do I tweak the software to make it more efficient?’ With neural net processing, because you’re trying to get this super huge increase in efficiency, you’re saying, ‘Given the algorithms I’m trying to do, what do I have to do in the hardware?’ And there are two aspects to that. One is how finely to slice the mathematical problems into how many different types of hardware accelerators there are. Two, within the architecture, how do I connect it for the data flow?”
As such, the algorithm is more important than both the software and the hardware, he said. “Ideally you would accelerate everything as much as possible and just have a tiny little bit of software on there that just does station-keeping type stuff. If you wanted to map the best performance, the more you have in the hardware itself, the faster it’s going to run. So each of these teams is doing a calculation as far as, what’s the software bill of materials to do this, what’s the hardware bill of materials, and it’s a slider that they can work with.”
Programmability
Because software algorithms are changing faster than hardware can be designed and manufactured, there is growing discussion around including some type of programmability in accelerators.
“This is a major challenge in the industry today,” said Ron Lowman, strategic marketing manager at Synopsys, noting that the impact varies depending upon the application and the device. “Recognizing license plates can be done with a very small processor in a very small footprint. But there are applications that still require very large data sets for coefficients and weights and very complex graphs that have to be computed to give you a good result.”
Mentor’s Bailey predicts one of the application areas that will take advantage of this is 5G, given the rate it is evolving. “Even though it’s going to be standardized, I’m sure that everybody knows that there’s going to be refinement on the implementation—especially as you try to get standardization across many different vendors from both the smaller base station form factors, but also the end devices that are communicating with them to make sure that they work across different vendors. You can only do so much with FPGAs initially. You need to drive down that cost and get to an ASIC implementation. But [systems companies] still want to have some programmability left in them.”
As a result, embedded FPGA fabric IP companies are being investigated by industry players across the board. “That’s a real challenge, especially from a design perspective to figure out where you need to have the programmability,” Bailey said. “If you don’t design it in, then you’re not going to be able to have the adaptability that you’re expecting to get.”
The problem with programmability in the context of accelerators, “is usually when you go to a programmable hardware fabric, you take a performance hit and an area hit, so you have to figure out where those tradeoffs are,” said Arm’s van Hensbergen. “But that is a grail for a whole bunch of the academic research community where they just say, ‘Okay, we’ll just be programmable and we’ll compile to it.’ That works in an academic environment for certain applications. But making it cost-effective, power-effective, area-effective — we’re a ways off. There are other technologies like CGRA (coarse-grain reconfigurable arrays) and things like that, which are variations on the theme. They seem a bit more practical, and we’re definitely involved through our academic partners as well as within our research team, which is looking at those sorts of technologies, as well.”
Architecture challenges
Still, one of the big challenges for accelerators is that they tend to become fixed-function devices, Louis noted. “If you’re doing an accelerator for something that’s well known and well understood, that’s one thing. But there are a lot of dynamic changes happening with [accelerator applications], so it’s important to build programmability into the accelerator. They have to have flexibility and programmability to adapt to evolving needs.”
What this theoretically could look like architecturally is an SoC with many functions, different processing blocks, with the accelerator being one of them. “To add programmability and flexibility to the accelerator, if you depend on the main processing units like the CPU, you run into a lot of challenges in transferring data between the two,” Louis said. “So the performance becomes a bottleneck. Also, you may run into power issues, so you have to somehow have a level of flexibility built in very close to the accelerator. This could be a DSP in combination with the accelerator, for example. Latency is your key criteria in an application like that. The programmability and the flexibility needs to happen at a very fast pace. That is where a DSP as an add-on to the accelerator helps do that in a flexible way, and also to be able to provide the control and share the data between the accelerator and the processor — essentially handle that in a more efficient way.”
System-level design
Still, the architecture of a device isn’t about the core. It’s about the system.
“The whole system has to work,” said Gajinder Panesar, CTO of UltraSoC. “You could have an Arm, you could have a RISC-V core, MIPS, or your own homegrown stuff. They’re not useful unless you make the whole system work.”
Systemic complexity is difficult enough to manage in homogeneous systems, but it is significantly harder to make sure something works in a heterogeneous system with algorithms that are in a state of constant flux and hardware that needs to adapt to those changes.
“Verifying something in one little bit is not enough,” Panesar said. “You need to verify whole SoCs, and then whole systems. In order to optimize and understand the behavior, you need visibility of that system to understand what your performance is going to be.”
Synopsys’ Lowman agrees. He said there are three components that need to be considered at the system level—processing power, memory constraints and real-time connectivity. All of those play a big role in the types of accelerators used in SoCs, which in turn are defined by whether chips are being used for training or inferencing, as well as where that processing is being done.
Conclusion
So where will the accelerator market will be in a few years? No one can say for sure. In part, that will depend on what the application is.
“On the edge, because of power consumption requirements and the size, etc., it’s hard to go in all the areas that you’d want to go,” said Mentor’s Bailey. “It’s more constrained, and architecting it means how you define the set of functionality that absolutely has to be there. And then, how do you take advantage of connectivity and things that can be done and offloaded? 5G is going to be really big in enabling this because while it’s not quite real time, near-real-time types of reactivity will be possible through it.”
On the server side, meanwhile, the architecture will likely be very different.
“For server based applications for machine learning where you chunk all the data to come up with the implementation so they can do things like recognition out on the edge, we see that split happening,” Bailey said. “That will likely continue, and that’s a system-level type of architecture. How do you manage that? You’re still going to have all this big data coming in. The amount of data that’s driving all these applications is increasing by 10X every couple of years.”
That’s a lot of data that needs to be accelerated, which is why this market is suddenly getting a lot of attention.
Related Stories
FPGAs Becoming More SoC-Like
Lines blur as processors are added into traditional FPGAs, and programmability is added into ASICs.
Huge Performance Gains Ahead
Where the next boosts will come from and why.
Processing In Memory
Growing volume of data and limited improvements in performance create new opportunities for approaches that never got off the ground.
The Secret Life Of Accelerators
Unique machine learning algorithms, diminished benefits from scaling, and a need for more granularity are creating a boom for accelerators.
The Next Phase Of Machine Learning
Chipmakers turn to inferencing as the next big opportunity for this technology.
Broken link at “Huge Performance Gains Ahead”
fixed. Thank you!
I spent a career in system design and architecture. The first thing that must happen is to realize that systems are built by connecting functional blocks of hardware and software.
There are functional/physical interface sequences required for every interconnected block.
Existing EDA tools are focused on synthesis and ignore the interconnection of inputs and outputs.
However Object Oriented Programming Language Compilers and Debuggers are all about the interconnection of blocks(classes/objects).
The C# Compiler and Visual Studio are very good for system modeling and debug. Interface handshaking and data transfer are visible with breakpoints for debug.