Sharing resources can significantly improve utilization and lower costs, but it’s not a simple shift.
Data centers are undergoing a fundamental shift to boost server utilization and improve efficiency, optimizing architectures so available compute resources can be leveraged wherever they are needed.
Traditionally, data centers were built with racks of servers, each server providing computing, memory, interconnect, and possibly acceleration resources. But when a server is selected, some of those resources go unused, despite being needed somewhere else in the data center. With the current model, existing resources cannot be leveraged because the server blade is the basic unit of partition.
This has led to a complete reorganization in the hyper-scaler data centers to use compute resources more efficiently, and the idea now is beginning to percolate through other data centers.
“Amazon, Microsoft, and Google are operating at a much bigger scale,” said Rakesh Renganathan, director of marketing, power management, power and sensors solutions business unit at Infineon. “A few years back, they were buying servers from PC vendors. Now they are building their own. The scale is big enough that simple architectural changes that they can influence and control today can turn into millions of dollars of savings.”
Adding that flexibility sounds straightforward enough, but it represents a massive change. “Everyone’s trying to make data center resources a service on tap — both on the software side and now adding hardware as a service on tap,” said Arif Khan, group director of product marketing for PCIe, CXL and interface IP at Cadence.
There is a new movement, generally referred to as “data-center disaggregation,” that moves away from the server as the basic unit. Instead, various resources are pooled together and allocated as jobs require. But composing and interconnecting the resources isn’t trivial. And a move to this architecture must be done in an evolutionary manner, without disrupting the older architecture resources already in place.
Focusing on the mix
The idea of a data center started as a place where multiple servers could be co-located and called on demand for computing. As the computing done in data centers has become more intensive, however, it might exceed the capacity of a single server. That has been addressed by allowing multiple servers to be engaged — up to an infinite number, in theory, limited only by the number of accessible servers.
As data centers are interconnected, the number of accessible servers no longer must be restricted to the number in a particular building or campus. As fiber connects different locations together, greater distances no longer will have the latency implications they’ve had in the past.
All of this has helped scalability — the ability to scale resources in accordance with the needs of any particular job. Having done that, however, we’re now faced with the next level of inefficiency — the mix of resources to be used on a given job. For instance, a given blade may have a fully utilized CPU, with a GPU helping perform some of the work. “You have a solution where I’m maximizing my CPU usage, but my GPU is only 30% virtualized,” said Renganathan. “That means 70% is overhead with no return on investment.”
Meanwhile, the amount of data that needs to be processed is accelerating. “We’re not even talking about 5G being mainstream yet,” noted Renganathan. “So the amount of data we generate is still kind of diluted.”
Data centers are well aware of this pending data deluge. “The people who have to deal with all the data, reflecting consumer behavior in this time of hyperconnectivity and hyper-scalability, have to react to this to make it as efficient as possible,” said Frank Schirrmeister, senior group director, solutions and ecosystem at Cadence. “And we now have the capability to do it.”
Different kinds of disaggregation
Disaggregation in the data center doesn’t mean the same thing to everyone, as there are multiple drivers for departing from the server-as-unit model.
One effort disaggregates the networking from the rest of the server. “Network disaggregation is now gaining traction in both cloud and enterprise data centers as a way to lower total cost of ownership,” said Eddie Ramirez, vice president of marketing for Arm’s Infrastructure line of business.
This places dedicated data-processing units (DPUs) onto boards for all of the networking functions, relieving the CPUs of any need to execute code relating to communications so they can concentrate on the actual data workload.
While the focus here is the data center, disaggregation also is happening at the chip level to a certain extent. “Large monolithic dies are being re-designed as chiplets to improve yield and lower cost,” said Ramirez. “At the same time, specialized processing (packet processing, GPUs, NPUs) are being integrated with general-purpose cores on-die, and also in chiplet form.”
Co-packaging of high-bandwidth memory (HBM) with CPUs or other accelerator SoCs brings memory and computing closer together – the opposite of disaggregation. Adding other kinds of processing within the same package goes in the complete opposite direction from disaggregation. It comes as a tradeoff between optimal utilization and the need for low latency.
That said, disaggregation is largely a data-center phenomenon. At the edge, higher levels of integration bring obvious benefits.
Reducing the cost of a data-center refresh
There are two main motivators for disaggregating servers in the data center. One of them is the cost of updating or “refreshing” servers to bring in a new generation of processor and/or memory. Such updates are needed relatively often.
But a complete server blade contains much more than CPUs and memories or other chips. It also includes fans, chillers, and other infrastructural elements that don’t need to be refreshed. If an entire server is replaced, then all of the infrastructure has to be replaced, as well, even though only the computing part is really getting an upgrade.
“There are a lot of white papers on these different blades and how they’re trying to reduce total cost of ownership by just swapping out a blade that has a CPU and memory, leaving the other components intact – because each of these components has a different life cycle,” said Khan.
In fact, some of the infrastructure components may not have to be swapped out for decades. “The server platforms are typically on four-year cycles on average,” said Renganathan. “If you have to spend on infrastructure every time you refresh, it’s just not productive.”
By separating out the computing from the infrastructure, the cost of a refresh goes way down. And perfectly good infrastructure isn’t simply thrown away by virtue of sharing real estate with obsolete computing units.
The server as unit of computing
In the traditional CPU-centric view of computing, scaling by adding servers has made sense because it scales the computing power. But CPUs don’t operate on their own. At the very least, they need memory, storage, and a way to talk to other internal or external entities. They also may be assisted by alternative computing resources like GPUs or other purpose-built accelerators.
“On each server blade, you have a CPU, you have a memory, and you have storage or even a GPU,” said Jigesh Patel, technical marketing manager at Synopsys. “Now, once you select a server blade — let’s say the application occupies just the CPU but not the memory or GPU — that resource is wasted.”
Fig. 1: A simplified view of multiple server blades, each with a prescribed set of resources. Source: Bryon Moyer/Semiconductor Engineering
With the current server model, the only way to provide flexibility is to have different server blades with different mixes of resources. But that can become complicated to manage. For example, let’s say that a particular server comes with a four-core CPU and 16 GB of RAM. If a particular job needs the four cores but only half the memory, then 8 GB of memory will sit unused. Alternatively, let’s say that it needs 24 GB of RAM, and no server with that configuration is available. That means one of two things — either storage must be used to hold the less-frequently accessed stuff, slowing performance, or a second blade is needed.
If a second blade is needed, it comes with its own CPU. That CPU will be used either to do nothing but manage access to the extra DRAM, which is a waste of computing power, or the program needs to be partitioned to run across two CPUs. That latter approach is not trivial and, depending on the application, might not even be possible.
Each of these examples shows a mismatch between the needed computing power and the required memory. The same could hold true for other resources, as well.
In its purest sense, the idea of the disaggregated architecture is to break the server up and pool like resources with like. So CPUs all go into one bucket, memory goes into another bucket, and perhaps GPUs go into yet a different bucket.
“You could completely imagine re-architecting everything if you had connections that would allow you almost unlimited distance at extremely high speed,” said James Pond, principal product manager, photonics at Ansys. “You might start putting all your CPUs together in one place and all your memory together in another place.”
Others points to similar trends. “People are trying to disaggregate each of [the server elements],” said Patel. “Instead of selecting a server blade, you select the CPU, memory, hard disk, and the GPU separately so that you can use the resources more efficiently.”
When an application spins up in this case, an assessment would be made as to what resources are needed and how many of each are required. Those resources would be allocated out of their respective buckets and composed into a virtual custom server. That server would have just enough resources to do the job, leaving the unused resources available for other jobs.
Fig. 2: In a disaggregated architecture, the different resources are pooled in separate “buckets.” A given application will have a processing platform composed out of the necessary components. Source: Bryon Moyer/Semiconductor Engineering
Making this happen, however, isn’t a matter of simply moving things from one board to another. Much of our infrastructure assumes the server-based approach, so alternatives will need work.
Memory as a conundrum
Memory access theoretically can be accomplished using the new CXL protocol, which allows aggregation of different pools of different memories while abstracting away their differences. That allocation isn’t limited to physical memories. Memory chips can be partitioned, and those partitions can be allocated as virtual memories.
“This way you don’t have to buy the largest amount of memory for every server,” said Steven Woo, fellow and distinguished inventor at Rambus. “You buy something based on what you expect across your whole data center. This allows you to have a minimal amount of memory per server — what you need for the minimum size job — and then to borrow what you need out of this pool for the larger jobs. That will make more economic sense if you can get it right.”
Woo noted that also allows memory to be replaced on a different schedule than servers. “The natural technology life cycles of all those technologies is different, so it would be nice to replace them differently, he said. “Being able to pool your resources means you don’t necessarily have to replace the discs or the DRAM at the same time you replace the CPUs. You can do it when it makes the most sense.”
This helps address the problem of accessing larger quantities of memory than might be available today. “There’s a size versus speed tradeoff, because a lot of these datasets tend to be very large,” said Khan. “And there’s a limit to the amount of DRAM or storage-class memory that can be physically close to the CPU. That’s what’s pushing some of these things out.”
The big challenge is latency. “We’re seeing demand for even more memory,” added Marc Greenberg, group director of product marketing for DDR, HBM, flash/storage and MIPI IP at Cadence. “At some point, you do need to have datasets close enough. If everything is going to be a network packet, latencies are going to kill you. While you can visualize CPUs and accelerators and memories as virtualized, there are physical limitations to how far away they can be.”
However, this raises a fundamental question. “If you put the memory a little bit further away, but you had a lot more of it, are you now better off?” asked Greenberg.
Memory pooling
This is leading to discussions on how to pool memory farther away from the processor. Memory like HBM can be close, while other memory can be much farther away, but they could still work together in a hierarchy. “If I put all my DDR memory in a big pool, I could have this local HBM memory that acts more like a cache,” Woo said. “You might have a pool per rack, you might have a pool for every couple of racks, because those would be the distances where you could service them with copper interconnects. If you start thinking about longer distances, you might start thinking about a different kind of interconnect technology. You might have a pool for every row or every few rows in a data center. They could all make sense, depending on what the workloads really look like. It’s like one of these Goldilocks type problems, where you don’t want it to be too far, and you don’t want it to be too near. Right now, the industry is still researching and experimenting to see exactly how far away they want these.”
The challenge is hiding that level of hierarchy from the programming. “The right mechanisms are going to be necessary for more advanced users to allow them to specifically place data in different tiers,” Woo noted. “But also providing an abstract model where programmers may not have to think about it if they don’t want to.”
At this week’s OCP Global Summit there was significant discussion about tiered memory and how that might work. There is much work yet to do, but it is an area of significant investment.
Making far look like near
All of this connecting up of resources has one huge requirement: the communication that happens while the application is running must be fast enough that it feels like the speed of a server. Once we run out of speed with copper connections, optical connections are the next obvious step. But rebuilding data-center racks with fiber involves other choices that the industry is still working out.
“If you have optical interconnects between ICs, you’ve now got something where the distance [signals] travel really doesn’t matter up to maybe a couple hundred meters,” said Pond. “That would suddenly change your thinking about how you would design a motherboard or a data center.”
“Photonics is much faster,” said Patel. “For such short distances, there is hardly any latency.” For copper interconnect, “… you might need several serial links, which you could cover with one optical waveguide for the same data. Disaggregation, in other words, is going to push more towards photonic connections.”
As a caution, however, it’s not necessarily guaranteed that optical would provide the kind of short-hop speedup that it provides for long-reach applications. This is because of the need for data conversion.
“You still need to convert data into something else,” said Greenberg. “With the SerDes, you convert it into a serial waveform. And in an optical system, you convert it into light. So there’s still a conversion that happens that takes latency.”
Over long distances, the conversion times would be very small in comparison with the actual transmission times. This is where the faster transmission of optical wins out.
But over short distances, that transmission time will make up a smaller percentage of the overall communication time, including conversions. Given the higher complexity (or at least the increased specialization) of optical, it may be a harder sell than just pushing serial copper links faster.
That said, bandwidth will also play a role, driving towards photonics in the long term. “Electronic transceivers just can’t handle 800 or 1000 GB/s,” said Rich Goldman, director, electronics and semiconductors business unit at Ansys. “So it’s going to be a requirement to go to photonics.”
The disaggregated data center won’t arise overnight
One of the major challenges of any major architectural change like this is that it can’t require the ripping out of the immense number of computing resources that have been built up over the last decades. And, however it happens, the business of the internet cannot shut down while hardware is overhauled. That means that there must be a way to maintain compatibility with the current default architecture, while incrementally adding resources in accordance with a new architecture.
Even so, it will take years for a new model to appear. “By the time IP solutions manifest themselves in available SoCs, and when those products get into servers and systems, there’s a long lead time for that,” said Khan.
CXL, an accepted standard, is only now seeing implementations come to market for its first version. CXL 2 implementations are a couple of years out. Other standardized ways of disaggregating and composing resources have yet to be finalized, and so more disaggregation is still further out in time than that.
“How realistic is it to make this practically viable?” asked Khan. “That’s something the system guys have to demonstrate. And then, is it commercially viable?”
So far, the answer isn’t clear. “These are architectures that are discussed and debated,” said Twan Korthorst, director, photonic solutions at Synopsys. “And in the meantime, technologists are trying to build all those individual building blocks to support these kinds of future architectures.”
Intel summarized its view of the disaggregated rack in a whitepaper: “Various data-center hardware resources, such as compute modules, non-volatile memory modules, hard disk (HDD) storage modules, FPGA modules, and networking modules, can be installed individually in a rack. These can be packaged as blades, sleds, chassis, drawers, or larger physical configurations…Resources being managed will have at least two network connections: a high-bandwidth data connection (Ethernet or other fabric), and a separate out-of-band Ethernet link to a dedicated management network…A rack can be connected to other racks making up a management domain called a ‘pod.’ …Once composed, the node can be provisioned with a bare-metal stack (operating system and application code), a virtual machine environment such as KVM or VMware, or a container environment like Docker.”
The payoff
Given the two main drivers of compute disaggregation – simpler refreshes and more optimal resource usage – only one has a distinct financial payoff. Replacing compute units while leaving infrastructural elements in place will reduce the capital expenditures needed for a refresh.
Optimizing compute resource utilization, on the other hand, is not likely to have a direct dollar savings. The return on investment comes instead from better utilization of the resources. More jobs can be done with the new architecture than with the old.
If computing load requirements were stable, then data centers could downsize — or perhaps delay the building of new racks based on doing more with the old ones. But requirements are anything but stable, and they promise to keep increasing without obvious end. So data center managers will still be able to do more with a better-optimized setup, and it may well be worth doing, but it’s not likely to show up as an outright reduction in spending — capital or operational.
As for network disaggregation, it has support from existing and new players. “Disaggregation is recognized as necessary across all aspects of infrastructure providers,” said Ramirez. “It is more disruptive to legacy business and operating models and friendlier to new entrants. However, even incumbent suppliers have been big advocates for moving such functions to network-optimized resources.”
Ultimately, segregating resources of whatever kind is necessary simply because there is no single universal best way to do things. “There’s never going to be one optimal architecture that fits everything,” said Khan. “You can’t design one perfect system that’s going to do your genomic studies and commercial transaction processing because the workload behaviors are different, and the peak performances are different.”
By aggregating resources differently for each application, one can come closer to optimal.
Related
Shifting Toward Data-Driven Chip Architectures
Rethinking how to improve performance and lower power in semiconductors.
RISC-V Targets Data Centers
Open-source architecture is gaining some traction in more complex designs as ecosystem matures.
Sweeping Changes Ahead For Systems Design
Demand for faster processing with increasingly diverse applications is prompting very different compute models.
Will Co-Packaged Optics Replace Pluggables?
New options open the door to much faster and more reliable systems.
Thank you for this excellent overview of the challenges and promises of disaggregation. One aspect not fully explored is what IDC calls “composable disaggregated infrastructure”, or CDI, which is a mouthful but describes the pooling of resources normally found inside the server, like accelerators and storage, into expansion chassis or pooling appliances that can then be shared across servers, but also enable the server to enumerate more resources than it could natively (say 32 GPUs to one CPU), opening up opportunities for faster time to results. Yet another possibility today is an all-PCIe rack, where PCIe becomes a routable network fabric connecting nodes and pooling appliances at native PCIe latency and bandwidth, still of course with the Ethernet connection for out-of-band management. GigaIO has installed such a system at TACC and other locations.