Data Center Architectures In Flux

Processor market pushes toward customized multi-chip systems, creating churn and new opportunities; RISC-V emerges as both catalyst and opportunity.

popularity

Data center architectures are becoming increasingly customized and heterogeneous, shifting from processors made by a single vendor to a mix of processors and accelerators made by multiple vendors — including system companies’ own design teams.

Hyperscaler data centers have been migrating toward increasingly heterogeneous architectures for the past half decade or so, spurred by the rising costs for powering and cooling racks of servers, by the need for tighter integration to handle AI/ML applications, and because of a massive uptick in the volume of data that needs to be processed. Coupled with the build-out of various levels of edge data centers, the whole data center industry is undergoing change.

This helps explain why Intel, which in the past resisted opening up its architecture to third-party IP, is shifting toward a more “democratized environment” for its chips. In addition to the company’s willingness to include Arm cores in its solutions — and agreement that has been in place for years — Intel now has joined RISC-V International as a premier member.

It’s still not entirely clear how this will play out. On one hand, it opens the door to more customized processing elements and accelerators based on the RISC-V architecture, which would pull RISC-V designs into the data center for the first time — although in what capacity remains to be seen. But perhaps even more far reaching, it sets the stage for much more customization by major chip vendors, which in the past relied on each new rev of Moore’s Law as their competitive weapon.

That approach no longer works, as evidenced by Apple’s M1 chip. Apple swapped out Intel chips in its laptops and tables for internally designed processors based on Arm cores, tightly integrating its native software to improve performance and increase battery life by as much as five times. Apple reportedly plans to change over its desktop machines and servers to Arm-based chips over the next couple years.

Arm has made inroads into the enterprise, as well. “Cloud computing plays a key role in enabling existing applications in media consumption, e-commerce, remote learning, unified communication, IT services, digital transformation, etc., and will play a pivotal role in driving the success of a new class of applications like machine learning, metaverse, autonomous driving, and smart IoT, among others,” said Dhaval Parikh, director of segment marketing for Arm‘s Infrastructure Line of Business.

Parikh noted that to keep up with the increased demands of existing applications, as well as new applications enabled by cloud computing, the hyperscaler and cloud service providers are looking to re-architect their next generation data centers with purpose-built heterogeneous infrastructures.

And this is where things get competitive. While unlikely to replace the primary processing elements anytime soon, RISC-V adds yet another option for customization, and it’s one that has been expected to begin infiltrating the data center for several years. Intel’s move will only accelerate that shift. Intel Foundry Services said earlier this month it is working Andes Technology, Esperanto Technologies, SiFive, and Ventana Micro Systems, to ensure that RISC-V runs best on Intel Foundry silicon and to accelerate time to market.

“Where everybody seems to be focused at the moment is on the two main advantages that RISC-V brings — it’s an open-source ISA, and there are no licensing fees,” observed Gajinder Panesar, fellow at Siemens EDA. “First, the open ISA is just for the CPU. But it’s not about the CPU. It’s about the system. You’ve still got to stick it in an SoC, that SoC needs to go in a box, and data rack, and so on. So even if you develop a CPU core, that’s not the end of it. Not paying licensing fees, that’s okay especially if you’re a startup because licensing fees can be quite crippling. You’re living hand-to-mouth on VC money. But for big players in this market, licensing fees are dwarfed by the cost of actually making a chip. Then there’s the cost of developing the chip, from design, implementation, verification, validation, and then fabricating it. You’ll be lucky to have some change left, especially with chips that are on the bleeding-edge technology. The fact that you saved yourself $2 million or $3 million on a license is neither here nor there when you’re paying around $80 million to $100 million to get a chip made, and you might have to re-spin because you’ve screwed up. Developing a chip based on an open-source ISA is one thing. You can compensate and come up with a special deal from the EDA tool providers, but you’ve still got to make the whole thing work. You’ve still got to put in the software stack, commit the OS, the security layers. If you’ve got security, then you need to have it audited. All those costs add up.”

Making the pieces work together
Behind the scenes, the shift toward increasing heterogeneity in the processor world has set off a massive scramble. The ecosystem to accommodate and support heterogeneous integration is still being built, and probably will be under construction for the foreseeable future. The shift from billion-unit processors to customized designs that can incorporate various chiplets in much smaller manufacturing runs remains a massive challenge for design teams.

“If the choice of processor is the only challenge you have, then people would really be adamant about it,” said Frank Schirrmeister, senior group director for solutions and ecosystem at Cadence. “But when you are trying to build something that is, from a RISC-V perspective, custom, there are so many challenges you need to deal with from the selection of the right IP, through the verification of hardware or software, to selecting the right software in context of the IP, to all the potential 3D-IC challenges of integration. Then, verifying all of that, putting it on the board, making sure to have enough airflow for it to not burn up the rest of the data center — there’s no shortage of challenges when you make these decisions. The choice of architecture is really just one. That’s why you go through the process to consider what will make your life easier, and to also make sure that it’s not all your fault if things go wrong. To put it bluntly, the situation is challenging for the system designer.”

Fig. 1: Homogeneous and heterogeneous die stacks showing memory on logic. Source: Cadence

Fig. 1: Homogeneous and heterogeneous die stacks showing memory on logic. Source: Cadence

What’s particularly interesting for EDA vendors is the ability to drive much deeper into systems and large processor companies, potentially using RISC-V as an entry point. “It’s a huge opportunity because it’s open source, but development is still costly,” said Natalija Colic, digital design engineer at Vtool. “It’s such a highly customizable processor, and the verification needs to follow that trend. The timing may have arrived for RISC-V to shine in a cluster of servers, etc., but effort still must be given to make this ISA a valid competitor.”

The buzz around RISC-V has been positive in other ways, she said. “This trend may force Arm, which has long held the monopoly, to possibly include RISC-V in some of its legacy products, for example. And because Intel, Google, and Arm are investing in RISC-V, this definitely will affect the markets, not only for clusters where you have these accelerators in the form of RISC-V, but also in smaller embedded chips like we do here in Vtool.”

Market shifts in slow motion
None of this will happen overnight, of course. Data centers historically have been conservative when it comes to change, and EDA tools take time to develop. But competition in the data center world is intense, and the rollout of heterogeneous architectures marks an inflection point.

“We’re seeing AI processors starting to use RISC-V with varying levels of customization, enhancements, and extensions,” said Rupert Baines, chief marketing officer at Codasip, noting that RISC-V’s success so far has been limited to AI, accelerators, and function-specific components from the likes of Esperanto, Mythic, and others. “You’re seeing deeply embedded applications. Nvidia has been using RISC-V for a number of years for the minion cores and the controller cores — not for the actual GPU functionality or the AI functionality, but for everything else. So we are seeing RISC-V steadily being used in data centers in a lot of ways, but not yet for the heavy duty Intel Xeon application-class processors. That’s still Intel, big time, AMD to a degree, and Arm just coming in. Nvidia, Ampere, Marvell — people like that with their Arm products — are moving into that space, and RISC-V isn’t there yet. But it will be.”

In fact, Baines predicts that mainstream data center application processor cores based on the RISC-V ISA are likely to be more commonplace in as little as three or four years.

At that point, the real value may be more about integration of various components than one vendor owning everything. Disaggregation is simple on paper. Re-aggregation of various pieces into a secure, efficient and reliable device is much more difficult, and big chipmakers like Intel and AMD have been scrambling to be able to put all the pieces together using a chiplet/tile type of approach. Foundries such as TSMC have been working on this approach, as well, using hybrid bonding to speed up the flow of data between chiplets.

Chiplet with pieces. Source: Cadence

Fig 2. A chiplet with pieces. Source: Cadence

RFIC package co-design -- Cadence

Fig 3. An RFIC package codesign. Source: CadenceIC package 3D-IC. Source: Cadence

Fig 4. IC package 3D-IC. Source: Cadence

This accounts for the steady stream of announcements and continual repositioning across the processor segment. Industry sources report that Arm recently has begun working with startups on more flexible licensing terms. That can help save time and effort.

“If Arm really suits your project, you should go with Arm because it is already tested, has all the features that you need, and so on,” said Olivera Stojanovic, project manager at Vtool. “But if you need something more specific, then RISC-V might be the option. But keep in mind it is a huge effort for the verification to check functionality of the CPU. You need to do the verification process to be sure that this open-source ISA-based CPU is completely verified.”

Underlying forces
While RISC-V certainly has garnered lots of interest, its success may be due less to its ability to drive massive change in the data center than some broader shifts in the market.

“Consumer needs are driving data center architecture developers to change the architecture accordingly to enable the right workloads,” said Cadence’s Schirrmeister. “That’s really how it trickles down. There comes a decision at one point where, ‘Now that I know all of this, the consumer wants their insights, the data center provider needs a solution for specific sets of workloads, how do I best enable that from a processor underneath?’ That’s why it’s not RISC-V by itself. Now you have a slew of decisions to make — the interface to the rest of the world. So which buses are supported so everybody has their own little variation there? Can I extend it nicely? Can I extend it well? Does it fulfill the requirements which come from it?”

In that context, RISC-V might be just one of a number of options. “If I’m a system architect and I take a chiplet based RISC-V core and integrate it, now I have to figure out if the software support is there,” he said. “Also, the question of my appetite for risk is raised. Can I offload the risk to somebody else to blame if it goes wrong versus taking it all on myself? That’s a hurdle to overcome. If you have that figured out, if the software support is there, if you feel comfortable with the risk it induces by potentially bringing in somebody who is a RISC-V enabler, then in the context of all the 50 decisions you had to make, the RISC-V choice plays a role. But it’s a steep set of considerations, because the others have very compelling arguments and reference designs and so forth.”

The uncertain future
So will an ISA like RISC-V impact data center architecture over time? Codasip’s Baines believes it will.

“One of the reasons for this is about who controls things, who makes the decisions,” Baines said. “If you’re a Google or a Facebook, that hardware supplier is you, and for the last 5 or 10 years, every Google data center has been full of Google servers designed to a Google specification. Increasingly, those people — the Googles, Facebooks, Microsofts — are not only specifying their own hardware, they’re specifying their own silicon. And with that, they control the stack from top to bottom. Therefore, if they want to, they will be specifying programming languages. Maybe they use Swift or Objective C, or GO. It might be different than what the rest of the world uses, and they don’t really care. They’ve got their own tool chains, and if they were to switch to different ISA, that would be under their own control, and they could do it. And they will do that if they can see an advantage. This brings back around the idea of functional compute and domain-specific compute. If you’re vertically integrated, and you control the software and the silicon, then it makes an awful lot of sense to invest in functional-compute, heterogeneous-compute, domain-specific architectures, which means you need control of the architecture. You cannot rely on an arm’s-length third-party supplier.”

Google-TPU for Google Cloud at Hot Chips 2019l=1

Fig 5: Google TPU for Google Cloud at Hot Chips 2019. Source: Semiconductor Engineering / Susan Rambo

At the same time, compute architectures are being constantly re-evaluated by some of these companies. “When we look at the architectures of today’s systems, it should be about the system, not about CPUs,” said Siemens’ Panesar. “People talk about high-end CPUs and how to do this, that, and the other thing. But actually, you need to put it in the context of the application. I’m disappointed because there’s very little innovation. If you scratched off RISC-V and put an Arm label on it, you couldn’t really tell. There’s no differentiation other than it’s a 32-bit or a 64-bit processor. There’s an opportunity being lost here in that an awful lot more could be done to change architectures in a more dramatic way than they are at the moment. Domain-specific architectures, in-memory compute — those concepts aren’t happening in the mainstream. There are probably niches looking at this, but innovation going forward will come from breaking away from the existing way of do things. For instance, having a cache-based system is the same architecture as it was when I started out in this industry a very long time ago, except they’ve got a new buzz buzzwords or acronyms. But it’s more or less the same. I’m not a great believer in things like caches and coherency because that’s a paradigm that people have hung onto and they’re putting Band-aids on for new applications. The reason I’m disappointed is that there’s an opportunity lost here.”

For some time, arguments have been presented to apply more domain-specific architectures in data centers.

“Data centers today tend to be very generic, one-size-fits-all sorts of things,” Panesar said. “You’re either taxing people who don’t want all of something, or you’re actually not efficiently servicing other potential customers by not providing a coherent, properly optimized solution for applications. We need to step back ask what the objective is. The objective is to have something for the 21st century that delivers and innovates and delivers products that are tackling problems or things that we want in the 21st century in a way that isn’t just taking something that existed, and polishing it up. There is an opportunity to take something that can be modified, and generally it’s an ISA, and put it into a system that is specific or domain-specific. That’s where innovations will come from. It’s not going to come from how well you designed your CPU. It’s about the system. And in order for that to happen, there needs to be an opportunity where all CPUs don’t look the same.”



Leave a Reply


(Note: This name will be displayed publicly)