Servers Are Becoming More Heterogeneous

Servers today feature one or two x86 chips, or maybe an Arm processor. In 5 or 10 years they will feature many more.

popularity

The number of CPUs in a server is growing, and so is the number of vendors that make those processors.

CPU server build has been one, two, four, and occasionally more x86 processors, with IBM’s Power and Z series as the major exception. While x86 processors aren’t necessarily being replaced, they are being complimented and augmented with new processor designs for a variety of more specialized tasks.

In the most recent Top500 supercomputer list, 140 of the supercomputers had Nvidia GPU co-processors, and that number will only grow. Within the next 5 to 10 years, general servers will be shipping with x86 processors, GPUs, FPGAs, Arm cores, AI co-processors, 5G modems, and networking accelerators.

This is recognition that one size does not fit all when it comes to application processing. End markets are splintering, and all of them are demanding customized solutions. As a result, the future of computing — particularly on the server side — is heterogeneous.

“What people are finding is that different chip architectures are better suited to do different types of workloads,” said Bob O’Donnell, president and chief analyst of TECHnalysis Research. “And because that workload diversification is going to continue, the need for diversified compute is going to continue. There are going to be other chips that are necessary. That doesn’t mean CPUs go away by any stretch, but there’s going to be a lot more variety in the other types of chips. And then the big question is going to be around interconnecting packaging.”

Intel has taken an aggressive stance on this with its XPU project, which combines CPU, GPU (through its new Xe GPU), FPGA from Altera, and AI processors, with an API to unify them. “I don’t think there’s going to be a single answer to how these will be in the future,” said Jeff McVeigh, vice president and general manager of data center XPU products and solutions at Intel. “But there will be a wide range of them, from tightly integrated monolithic to multi-chip packages integrated to system-level connections.”

The need for different compute architectures is driven by new data types, argues Manuvir Das, head of enterprise computing at Nvidia. “Every company has more and more data at their disposal. And companies are becoming willing to collect more and more data. And the reason for that is because they can now see that they can get value out of their data.“

The semiconductor industry has witnessed considerable M&A activity in recent months as companies diversify their offerings through purchase rather than organic growth.

  • Nvidia, a company that had not made a big acquisition in years, suddenly opened its checkbook to buy SmartNIC maker Mellanox for $7 billion and ARM Holdings for $40 billion.
  • AMD made its first major acquisition move in years with the proposed $35 billion purchase of Xilinx.
  • Marvell Technology acquired Arm server chip maker Cavium for $6 billion, and networking semiconductor maker Inphi for $10 billion.
  • Analog Devices has signed a deal to acquire Maxim Integrated products for $21 billion.

“They’re diversifying because they all recognize that they have to have a wide variety of different chip architectures,” said O’Donnell. “The hard part is going to be doing what Intel is trying to do with one API, which is, ‘How do I take these diverse architectures and make them usable by people?’ Each architecture requires different sets of instructions, different ways of programming, different types of compilers, etc.”

One chip or many?
The question then becomes will there be one big piece of silicon on the motherboard, or multiple sockets for each chipset? This is hardly a new idea. Systems-on-chip have existed for years. But SoCs are changing.

SoC designs typically pare down the processor, especially the GPU, to make all these chips fit into a reasonable thermal envelope. An SoC with a full CPU, GPU, and FPGA alone would have a TDP of about 700 watts, which would be thoroughly unappealing to anyone. If there is to be package designs, it likely will be scaled down processors.

“AMD has done some great work in the industry to show that chiplet packaging is possible for CPU cores and I/O chips. And if you wanted to get something a little beefier, you might build entire chiplets that are just, you know, one is maybe CPU cores, one is more neural network engine, and maybe one is a GPU, and you could put them together on the same package,” said Steven Woo, vice president of systems and solutions and distinguished inventor at Rambus.

Intel’s McVeigh is open to a multi-package design, as an option. “There are obviously benefits from doing single package design in terms of the memory bandwidth, but then there are also limits on just how much you can cram into any packaging. So I don’t think there’s going to be a single answer to how these will be in the future. But there will be a wide range of options, from tightly integrated, monolithic to multi-chip packages integrated to system-level connections,” he said.

Nvidia is open to the idea of multichip packages, as well, although its vision is like Intel’s. It supplies all of the silicon. Das noted that Nvidia already has an Arm/GeForce SoC in the form of Tegra, and the new Bluefield 2 line of data processing units (DPUs) that combine Mellanox ConnectX-6 network controllers with Arm CPUs and Ampere GPUs. In Nvidia’s roadmap, BlueField 4 in 2022 will feature all three CPUs on a single piece of silicon.

“If you just consider the amount of compute that is going to be done three years from now, and five years from now, if you don’t do it that way, the world just won’t be able to afford it. And so there will be multiple form factors. When you get closer to the edge, it will look much it will lean much more towards integrated solutions,” Das said.

But that’s Intel and Nvidia packing all of their own IP into one piece of silicon. When the prospect of two or more companies working together — say Marvell and AMD, for example — the view is one of doubt.

“It’s going to be difficult,” said Vik Malyala, senior vice president for FAE and business development at Supermicro. “Why would Intel or AMD open up everything about their processor architectures to Nvidia? The same is the case with Nvidia. Why would Nvidia open up all things with respect to their GPU to work with somebody? There’s a reason why they are trying to buy Arm.”

Eddie Ramirez, senior director of marketing for the infrastructure business unit at Arm, said there is precedent for multi-vendor chips. “If you were to look at 10 years ago, we were barely in the infancy of separating your design from your manufacturing. For SoCs now, that’s commonplace. So in the timeframe that you’re talking about, in 5 to 10 years, the ecosystem will develop to where you can build an FCM using silicon from different vendors,” he said.

However, he questions whether this is a good idea given that different chips have different lifespans. “It’s one thing to have a server with a PCI cards, and you can swap out the card. But when they’re in one package, you’ve got to replace everything at once. Does that work with different lifecycles? That is the interesting piece here,” he added.

Malyala also noted that chip vendors have multiple chips for different performance scenarios and putting a bunch into one package limits customer options. “Say, for example, if I am Xilinx, I have a dozen different FPGA s. But if I’m putting one in a given piece of silicon, I’m saying this is exactly how it’s going to be, and I’m stuck with that even if I am overprovisioned or underprovisioned,” he said.

The CXL equation
The current fix for non-CPU processors in a server is a PCI Express card. GPUs, SSDs, FPGAs, and other co-processors take up a PCIe slot, and there is only so much room in a server for cards, especially the ultra-thin 1U and 2U designs.

PCIe also has the limitation of being a point-to-point communication protocol. The Compute Express Link (CXL) protocol is rapidly gaining acceptance as an alternative to PCIe because it works with PCIe as well as alternative auto-negotiate transaction protocols.

“What’s really required as we kind of go into these more sophisticated architectures is kind of the various topologies that can be supported the peer-to-peer communication, the ability to kind of scale those out,” said McVeigh. “PCI Express by itself isn’t going to answer all those problems. But for cases where you want to be able to, obviously upgrade from existing designs, where you’ve got individual cards, and maybe not needing that at full kind of interconnectivity, it does very well there.”

A big plus for CXL is that it places the accelerator closer to the processor through its fast connection and, more importantly, it makes the memory attached to the accelerator part of system memory rather than private, device memory. This takes the load off system memory and reduces the amount of data that has to be moved around, since data in a device’s memory (such as a GPU) is readily visible without moving it back and forth between system memory.

Whether the multiple processors are on a single die or multiple dies, they have to be tied together somehow, and CXL is viewed as the mesh to bind them. PCIe has its uses, but it is a point-to-point protocol, not a mesh like CXL. Plus, CXL allows processors to share memory, something PCIe cannot do.

“CXL is definitely very credible,” said Rambus’ Woo. “If the industry really gathers around it, that will be a kind of a stepping stone to the evolution of a new type of interconnect, where we’ll optimize it more heavily around what has to happen to connect nodes to each other — and then maybe to connect processors to memory and disaggregation scenarios, and maybe even connect processors to things like GPUs and storage.”

An example of where CXL comes in is with the concept of having a coherent memory access among the different endpoints with PCIe, said Ramirez. If you are trying to do a certain amount of compute on one accelerator, and it needs to talk to other accelerators, they should be able to speak directly rather than use a hub-and-spoke model where everything has to go to one point for coordination. “PCI Express doesn’t inherently have that capability,” said Ramirez.

It’s possible that a whole new kind of standard will evolve with its basis in the good parts of PCIe, leaving out the parts that aren’t needed. Woo noted that when two PCI Express devices first start talking to each other, they negotiate using PCIe Gen 1, then step up to successive generations until they find the top speed at which they can talk.

“That whole initialization sequence is a little bit more burdensome,” said Woo. “If you think about it from a silicon designer standpoint, you’ll say, ‘Wait a minute, I have to put all these gates in, and they’re going to get used just to figure out that I can talk faster — and I’m not going to use those transistors anymore.’ There’s a beauty in having that kind of simple protocol. As a silicon designer, I would rather use those gates for something else.”

One API to rule them all
Hardware without software is just a pile of metal, so the real question behind these efforts is how will they be brought together. Intel has the most complete solution with its oneAPI program. oneAPI provides libraries for compute and data-intensive domains, such as deep learning, scientific computing, video analytics, and media processing.

oneAPI interoperates with code written in C, C++, Fortran, and Python, and to standards such as MPI and OpenMP. It also has a set of compilers, performance libraries, analysis and debug tools, and a compatibility tool that aids in migrating code written in CUDA to Data Parallel C++ (DPC++), an open, cross-architecture language built upon the C++ and Khronos SYCL standards.

DPC++ extends these standards and provides explicit parallel constructs and offload interfaces to support a range of computing architectures and processors. Of course it supports Intel, but McVeigh said he hopes other chip firms will adopt it, as well.

“We view it as very much as an industry initiative — glue to tie together these heterogeneous architectures with a unified programming model,” McVeigh said. “And we’ve used that as the critical element to really tie together these architectures so you have a way to program them with a common language, a common set of libraries that works with the OS vendor solutions, not only Intel products.”

O’Donnell believes the software solution will come across the board, from BIOS and driver vendors to Linux distros like Red Hat Enterprise Linux and Ubuntu from Canonical. “It’s a such a multi-layered stack,” he said. “As it is now, it’s across the board. I don’t think you’re going to see a single point of solution. There’s just too many pieces involved.”

Conclusion
The server industry will need more proof points for the validity of heterogeneous computing. But it’s not a solution in search of a market. Many markets exist, and new ones are being developed with the rollout of the edge. What’s changed is that solutions are being tailored to them, rather than end markets adapting to the best off-the-shelf technology that’s available.

“Conceptually, it just makes sense that we’re going to need different chip architectures,” O’Donnell said. “We need a single software platform to take advantage them, but it needs to kind of magically do so, under the covers through this hardware abstraction layer and everything else.”

As people start to use multichip architectures, are we’re going to start to see it working the way they expected? Are we getting the performance benefits that people expected? Is it cost efficient? How does this actually work in the real world?

“Beyond the theory, that’s what remains to be seen,” he said. “We’re going to have to see that at multiple levels. Intel is going to drive it, but you’re going to see other companies try to drive it, as well.”

Related
New Architectures, Much Faster Chips
Massive innovation to drive orders of magnitude improvements in performance.
Data Overload In The Data Center
Which architectures and interfaces work best for different applications.
Top Tech Videos Of 2020
What engineers were watching in 2020.



1 comments

Jayn says:

Have you seen any discussion of CPUs with CXL slave capability. That would enable biased coherency and the possibility of an increased number of CPU co-processors without the symmetric coherency overhead and socket limits.

Leave a Reply


(Note: This name will be displayed publicly)