System Design For Next-Generation Hyperscale Data Centers

New networking and architecture co-design opportunities are creating a fundamental shift in the data center.


As we are in the process of hyperscaling the large volumes of data that our devices and sensors create, processing this data along the way at far and near edges, and transmitting the hard-to-imagine volumes of data through networks to data centers for processing, the data center itself is undergoing a fundamental shift with new networking and architecture co-design opportunities. In a previous blog post, I addressed some of the aspects of everything becoming programmable and the four core elements that Facebook considers for their data center development—compute, storage, memory, and networking. This post adds Google’s view to that picture with regard to how networking and compute intertwine in the next, fifth epoch of distributed computing.

What’s in a system? What is a system to begin with?

Very early in my career, I had humbling experiences when it comes to the aspects of “system design” and “system complexity.” My first chip design, the topic of my master’s thesis, was a chip for fast Fourier transformation (FFT) funded by the German Telekom. It seemed incredibly complex at the time I was working on it, like a system in itself.

That’s until I realized that it would be used four times—first to do transformation in two dimensions from the discrete time domain to the discrete frequency domain, feeding the output of the first two chips into the one doing the intermediate calculations. Then it would be used two more times to transform back from the frequency to the time domain and then feeding into yet another chip that was determining the maximum of the data for the duration of a frame of video. Voila, six chips on a board were doing what was referred to as phase-correlation.

That board became part of the next bigger system, which basically pre-processed the values before the pattern matching of blocks was performed in yet another chip. When combined, all of these accelerated the calculation of the motion vectors needed to do the MPEG encoding of HDTV video signals, the predecessor of the algorithms powering the video in our Zoom/WebEx/Teams calls today.

The point of the story? Whatever looks like a system to the developer, becomes a component within the next bigger system of bigger scope.

Turning to data centers, what is referred to as system design highly depends on the scope and context of the object that is under design, driven by the top-down requirements. Inspired by keynotes from Facebook and Mellanox, I had charted in my blog post “The Four Pillars of Hyperscale Computing” the progression of hardware and software in the data center from 2008 to 2020. The inspiration for next-generation requirements for data center design comes from Google’s keynote, “Plotting a Course to a Continued Moore’s Law”, by Partha Ranganathan and Amin Vahdat at the Open Networking Foundation in September 2019.

Partha’s part of the talk was about “the slowing of Moore’s law and the opportunities it has for co-design across the intersection of computer architecture and networking.” Motivated by graphs contrasting the improvement of SPECInt Rates 2006 Median versus the hours per minute of videos uploaded to YouTube (eerily looking like the design gap graphs from ITRS) and the 300,000X increase in AI compute demand from AlexNet in 2012 to AlphaGo Zero in 2017 in just 5 years, they illustrated the lessons learned when building efficient hardware accelerators. It is all about the proper system design of software, hardware compute, storage, memory and networking. If an accelerator is not properly accessible, it does not really accelerate. The conclusion to all the challenges is to extend programmability to the full data center, as the graph in my previous post showed, and to further integrate networking, compute, memory and storage.

Amin’s portion of the Google keynote is a great illustration of the latter. He discussed how the next-generation network interconnect between servers has to reach microsecond latency in the context of new software and hardware for direct reads/writes of remote servers across the resource pools “compute”, “flash”, “NVRAM” and “DRAM.” He referred to this as the fifth epoch of distributed computing and named it “Max General Compute, Accelerators and Tail at Scale”, requiring 10us latencies for memory/storage access with distributed runtimes, and multi-server, multi-threaded concurrency. The current fourth epoch of “Max Single Thread, Massive Scale Out”—the one that we are in—requires 100us latencies with fault tolerance, load balancing, non-uniform memory accesses and multithreaded concurrency for performance.

These are some ginormous challenges. How do you deal with the design to address them?

Well, it looks like one big set of co-design exercises across hardware and software, networking and compute architecture, and electronic/mechanical/thermal effects. And these challenges exist at several levels of scope from data centers all the way to the semiconductor IP in the underlying chips. They require co-design from the multi-rack data center infrastructure—which by itself is just a component of the full hyperscale network enabling the data journey from sensors through devices, near and far edge, networks to the data center—through the clusters at which micro-second latency is achieved. These include the boards, the packaged chips that due to the reticle limit often are 3D-IC integrations of chiplets, the systems on chips (are we allowed to call them systems anymore?), the subsystems as part of the chips, and the processor and design IP enabling the subsystems.

The illustration below shows the different levels of scope from a tools and semiconductor IP perspective. Computational software is at the core of all of it, and traditional EDA is only a component of the much bigger market of technical software. The tools must properly interact to address the challenges at the scope of racks, servers, PCBs, daughtercard modules, packaging, the IC itself and the underlying processor and design IP. The related challenges are truly cross-disciplinary with critical cross-dependencies, electromagnetic and electro-thermal effects requiring finite element analysis (FEM) and computational fluid dynamics (CFD), complex signal and power integrity, and EM radiation.

As indicated, artificial intelligence and machine learning (AI/ML) are not only enabled by applying EDA and computational software flows, but they are also using AI/ML to make these flows more productive. For more details, please see my recent keynote at the ACCAD workshop at ICCAD.

I haven’t even touched on the software aspects yet. As Amin pointed out at ONF, while teams can tune and optimize software to latency needs for specific high-value use cases today, some of the upcoming challenges in the fifth epoch can no longer be done correctly by humans. We will need the equivalent of operating systems to manage the distributed runtime state of the software/hardware architecture.

I can only echo Partha’s sentiment at the opening of the ONF keynote. “I’m here to tell you how your career is secure for the next two decades so it’s a pleasure to be here!” Happy Holidays and here’s to the next 20 years.

Leave a Reply

(Note: This name will be displayed publicly)