What Is An xPU?

Almost every letter of the alphabet has been used to describe a processor architecture, but under the hood they all look very similar.


Almost every day there is an announcement about a new processor architecture, and it is given a three-letter acronym — TPU, IPU, NPU. But what really distinguishes them? Are there really that many unique processor architectures, or is something else happening?

In 2018, John L. Hennessy and David A. Patterson delivered the Turing lecture entitled, “A New Golden Age for Computer Architecture.” What they concentrated on was the CPU and the way that it had evolved, but that is only a small part of the total equation. “Most of these things are not really a processor in the sense of being a CPU,” says Michael Frank, fellow and system architect at Arteris IP. “They’re more like a GPU, an accelerator for a special workload, and there is quite a bit of diversity within them. Machine learning is a class of processors, and you just call them all machine learning accelerators, yet there is a large variety of the part of the processing they accelerate.”

The essence of a processor can be boiled down to three things. “At the end of the day, it really does come back to the instruction set architecture (ISA),” says Manuel Uhm, director for silicon marketing at Xilinx. “That defines what you’re trying to do. Then you have the I/O and the memory, which support the ISA and what it’s trying to accomplish. It’s going to be a really interesting time going forward, because we are going see a lot more innovation and change than we’ve seen in the last two- or three-plus decades.”

Many of the new architectures are not single processors. “What we are seeing is a combination of different types of processors, or programmable engines, that live in the same SoC or in the same system,” says Pierre-Xavier Thomas, group director for technical and strategic marketing at Cadence. “There is dispatching of the software tasks to different hardware or flexible programmable engines. All of the processors may share a common API, but the execution domain is going to be different. This is really where you will see different types of processing with different type of characteristics.”

The reality is that much of the naming is marketing. “The key thing is that people are using these names and acronyms for two different purposes,” says Simon Davidmann, CEO for Imperas Software. “One is for explaining the architecture of a processor, like SIMD (single instruction multiple data). The other defines the application segment that it is addressing. So it can define either the processor architecture, or a brand name like Tensor Processing Unit (TPU). They are putting a name to their heterogeneous or homogeneous architecture, which is not a single processor.”

A little history
Things were much simpler 40 years ago. There was the central processing unit (CPU), and while there were many variants of it, they were all fundamentally von Neuman architecture, Turing complete processors. Each had different instruction sets that made them more efficient for certain tasks and there was plenty of discussion about the relative merit of complex instruction set (CISC) versus reduced instruction set (RISC).

The emergence of RISC-V brought a lot of attention to the ISA. “People want to understand the ISA because it is the ISA that defines how optimized the processor is for a defined task,” says Xilinx’s Uhm. “They can look at the ISA and start counting cycles. If one ISA has a native instruction and operates at one gigahertz, I can compare that to another processor ISA where the same function may require two instructions, but the processor runs at 1.5 gigahertz. Which one gets me further ahead? They do the math for the important functionality.”

CPUs have been packaged in many ways, sometimes putting IO or memory into the same package and they were called micro-controller units (MCU).

When modems became fashionable, digital signal processors (DSP) emerged, and they were different because they used the Harvard architecture. That separated the instruction bus from the data bus. Some of them also implemented SIMD architectures that made data crunching more efficient.

The separation of instructions and data was done to increase throughput rates, even though it restricted some fringe programming that could be done, such as self-writing programs. “Often, it is not compute that is the boundary condition,” says Uhm. “It is increasingly the I/O or memory. The industry switched from jacking up compute, to making sure that there’s enough data to keep the compute crunching and maintain performance.”

When single processors stopped becoming faster, multiple processors were linked together, often sharing memory and maintaining the notion that each processor, and the total cluster of processors, remain Turing complete. It didn’t matter which core any piece of a program was executed on. The result was the same.

The next major development was the graphics processing unit (GPU), and this broke the mold because each processing element or pipeline had its own memory that was not addressable outside of the processor. Because the memory was finite, it meant that it could not perform any arbitrary processing task, only the ones that could fit in the provided memory space.

“GPUs are very capable processors for certain type of functions, but they have extremely long pipelines,” notes Uhm. “Those pipelines keep the GPU units crunching on data, but at some point, if you have to flush the pipeline, that’s a huge hit. There is a significant amount of latency and non-determinism built into the system.”

While many other accelerators have been defined, the GPU — and later the general-purpose GPU (GPGPU) — defined a programming paradigm and software stack that made them more approachable than accelerators of the past. “Over the years, certain jobs have been specialized,” says Imperas’ Davidmann. “There was the CPU for sequential programs. There was the graphics processor, which focused on manipulation of data for a screen and introduced us to a highly parallel world. Tasks were performed using lots of little processing elements. And now there are machine learning tasks.”

What other construction rules are there to be broken that can explain all of the new architectures? In the past processor arrays were often connected through memory, or a fixed network topology, such as mesh or toroid. What has emerged more recently is the incorporation of a network on chip (NoC) that enables distributed, heterogenous processors to communicate in a more flexible manner. In the future, they also may enable communications without using memory.

“At this point, NoCs only carry data,” says Arteris’ Frank. “In the future, the NoC could expand into other areas where communication between accelerators goes beyond data. It could send commands, it could send notifications, etc. The communication needs of an accelerator array or, sea of accelerators, might be different than the communication needs of, for example, CPUs or a standard SoC. But network on a chip does not constrain you to just a subset. You can optimize and improve performance by supporting special communication needs of accelerators.”

Implementation architecture
One way that processors differentiate is by optimizing for a particular operating environment. For example, software may run in the cloud, but you may also execute the same software on a tiny IoT device. The implementation architecture will be very different and achieve different operating points in terms of performance, power, cost, or the ability to operate under extreme conditions.

“Some applications were targeted for the cloud, and now we’re bringing them closer to the edge,” says Cadence’s Thomas. “This may be because of latency requirements, or for energy or power dissipation, and that would require a different type of architecture. You may want to have exactly the same software stack to be able to run in both locations. The cloud needs to provide flexibility because it will be receiving different types of applications and has to be able to aggregate a number of users. This requires the hardware on the server to be application-specific capable, but one size does not fit all.”

ML adds its own requirements. “When building intelligent systems with neural networks and machine learning, you need to program new networks and map this to hardware, using software frameworks and a common software stack,” adds Thomas. “You can then adapt the software application to the right hardware from a PPA standpoint. This drives the need for different types of processing and processors to be able to address these needs at the hardware layer.”

Those needs are defined by the application. “One company has created a processor for graph operations,” says Frank. “They optimize and accelerate how to follow graphs, and do operations such as reordering of graphs. There are others that mostly accelerate the brute force part of machine learning, which is matrix-matrix multiplies. Memory access is a particular problem for each architecture, because when you build an accelerator, the most important goal is to keep it busy. You have to get as much data through to the ALUs as it can consume and produce.”

Many of these applications have a number of things in common. “They all have some local memory, they have a network on chip to communicate things around, and each processor, which executes a software algorithm, is crunching on a small chunk of data,” says Davidmann. “Those jobs are scheduled by an OS which runs on a more conventional CPU.”

The tricky bit for hardware designers is predicting what tasks it will be asked to perform. “Although you’re going to have similar types of operation in some of the layers, people are looking at differentiation in the layers,” says Thomas. “To be able to process the neural network required several types of processing capabilities. It means that you need to be able to process a certain way for one part of the neural network, and then another type of operations may be required to process another layer. The data movement and the amount of data is also changing layer after layer.”

This differentiation can go beyond the data movement. “For genome sequencing, you need to do certain processing,” says Frank. “But you cannot accelerate everything with a single type of accelerator. You have to build a complete set of different accelerators for the pipeline. The CPUs become the guardian that shepherd the execution flow. It sets things up, does the DMA, provides the decision-making process in between. There is a whole architecture task to understand and analyze algorithms and define how you want to optimize the processing of them.”

Part of that process requires partitioning. “There is no single processor type that’s optimized for every single processor task — not FPGAs, not CPUs, not GPUs, not DSPs,” says Uhm. “We created a series of devices that contain all of those, but the hard part on the customer side is that they have to provide the intelligence to determine which parts of this entire system are going to be targeted at the processors, or at the programmable logic, or at the AI engines. Everyone wants that auto-magical tool, a tool that can instantly decide to put this on the CPU, put that on the FPGA, put that on the GPU. That tool does not exist today.”

Still, there always will be a role for the CPU. “CPUs are needed to execute the irregular part of the program,” says Frank. “The general programmability of the CPU has its advantages. It just doesn’t work well if you have specialized data structures or mathematical operations. A CPU is a general processor, and it is not optimized for anything. It’s good at nothing.”

Changing abstraction
In the past, the hardware/software boundary was defined by the ISA, and that memory was contiguously addressable. When multiple processors existed, they were generally memory-coherent.

“Coherence is a contract,” says Frank. “It is a contract between agents that says, ‘I promise you that I will always provide the latest data to you.’ Coherence between equal peers is very important and will not go away. But you could imagine that in a data flow engine, coherence is less important because you’re shipping the data that is moving on the edge, directly from one accelerator to the other. If you partition the data set, coherence gets in the way because it costs you extra cycles. You have to look things up. You have to provide the update information.”

That calls for different memory architectures. “You have to think about the memory structure because you only have so much tightly coupled memory,” says Uhm. “You could access adjacent memory, but you quickly run out of adjacencies to be able to do that in a timely fashion. That has to be comprehended in the design. As the tools mature, more of that will start to become understood by the tools. Today it is done by human intelligence, by being able to understand the architecture and apply it.”

There also is a need for higher levels of abstraction. “There are frameworks where you can map, or compile, known networks onto target hardware,” says Thomas. “You have a set of low-level kernels, or APIs, that will be used in the software stack, and then eventually used by the mapper of the neural network. Underneath, you may have different types of hardware, depending on what you want to achieve, depending on your product details. It implements the same functionality, but not with the same hardware, not on the same PPA tradeoff.”

That puts a lot of pressure on those compilers. “The main question is how do you program accelerators in the future?” asks Frank. “Do you implement hardwired engines that are just strung together like the first generation of GPUs? Or do you build little programmable engines that have their own instruction set? And now you have to go and program these things individually and connect each of these engines, executing tasks, with a data flow. One processor has some subset of the total instruction set, another one has a different subset, and they will all share some overlapping part for the control flow. You might have some that have slightly different acceleration capabilities. The compilers, or the libraries that know about it, map accordingly.”

The architecture of processors is not changing. They still abide by the same choices that have existed for the past 40 years. What is changing is the way in which chips are being constructed. They now contain large numbers of heterogeneous processors that have memory and communications optimized for a subset of application tasks. Each chip has made different choices about the processor capabilities and what they are optimized for, about the required data throughput, and about the data flows that typically will be seen.

Every hardware provider wants to differentiate its chip from the others, but that’s a lot easier to do that by branding than by talking about the technical details of the internals. And so they give it a name and call it the first, the fastest, the largest, and tie it to a particular type of application problem. The three letter acronyms have become application task names, but they do not define the hardware architecture.

Challenges For New AI Processor Architectures
Getting an AI seat in the data center is attracting a lot of investment, but there are huge headwinds.
Ten Lessons From Three Generations Shaped Google’s TPUv4i
Evolution of Google’s TPUv4i
New Architectures, Much Faster Chips
Massive innovation to drive orders of magnitude improvements in performance.
Improving Medical Image Processing With AI
Faster, smarter imaging opens doors to everything from 4D modeling and higher resolution with less noise.

Leave a Reply

(Note: This name will be displayed publicly)