Why Intel, AMD, Arm, and IBM are focusing on architectures, microarchitectures, and functional changes.
Big chipmakers are turning to architectural improvements such as chiplets, faster throughput both on-chip and off-chip, and concentrating more work per operation or cycle, in order to ramp up processing speeds and efficiency.
Taken as a whole, this represents a significant shift in direction for the major chip companies. All of them are wrestling with massive increases in processing demands and the inability of traditional approaches to provide sufficient improvements power, performance and area. Scaling benefits have been dwindling since 28nm, and in some cases well before that. At the same time, an increasing amount of data collected from new devices, new applications, and a proliferation of sensors everywhere, needs to be processed much more quickly using the same or less power.
This amounts to a perfect storm for chipmakers, which in the past have utilized such approaches as speculative execution to augment the benefits of scaling. But speculative execution has been shown to create security vulnerabilities, and just shrinking features no longer provides a 30% to 50% improvement in power and performance. The numbers today are closer to 20%, and even that requires new materials and structures.
Meanwhile, large chipmakers are seeing incursions by companies such as Google, Amazon and Facebook into one of their key markets—giant data centers. In addition, they are being challenged in the AI/machine learning market and at the edge by a slew of startups developing specialized accelerators, which are promising orders of magnitude improvements through architectural changes.
Rather than trying to fight this trend, the largest chipmakers are starting to embrace it. AMD, for example, has introduced its Zen 2 architecture that relies on a combination of chiplets—made by them and others—with high-speed chip-to-chip interconnects and prioritization schemes that can be tuned so that data can move faster in one direction or the other.
Dan Bouvier, client products chief architect at AMD, said in a presentation at the Hot Chips conference that small die would improve yield. But he noted that chiplets also can be used to increase the die size to 1000mm2, which is larger than reticle size, by using a common interconnect (AMD’s Infinity fabric) and putting all of these components on a substrate. That interconnect also can be used to connect chips developed at different process nodes, depending on what makes the most sense for a particular function.
Fig. 1: AMD’s chiplet architecture. Source: AMD/Hot Chips
Intel’s strategy relies heavily on chiplets, as well, which it connects using a variety of approaches, including its internally developed chip-to-chip bridge (embedded multi-die interconnect bridge, or EMIB). But the company also has been working on the memory access and storage problem. One piece of that solution involves persistent memory, which helps bridge the gap between DRAM and solid-state drives.
For some time, Intel has been shipping one persistent memory type called 3D XPoint. Based on phase-change memory technology, Intel integrates 3D XPoint devices within its own SSDs and DIMMs, which speeds up the operations in those systems.
“One of the big challenges is that you’ve got all of this data that you need to process, but you are limited on space to put it,” said Lily Looi, senior principal engineer at Intel. “There has been an explosion of data over the last couple of years, and there are two things that have changed. First, nanoseconds matter, so you need more capacity. The second thing is that you need a persistent feature so that the data is still there if you turn off the power. But you don’t have to save all of that data. You may only need to save a block or even a few kilobytes of that data, which is a lot more efficient.”
Fig. 2: Where to store exponentially more data. Source: Intel/Hot Chips
Smarter tradeoffs
Bigger chips and faster interconnects aren’t the only way to achieve better performance, though. There are a whole bunch of knobs to turn that haven’t been seriously re-architected in years.
Arm, for example, introduced its Neoverse N1 architecture, which significantly improves the accuracy of branch prediction—basically the equivalent of pre-fetch in search. Arm also continues its push to do more with less power, with a coherent mesh network to connect IP tiles together, allowing processors to be sized according to the needs of a particular application.
Key to Arm’s strategy are larger level 2 caches, and context switching, which Andrea Pellegrini, a senior principal engineer at Arm, said is 2.5 times faster than previous approaches. “We’re also seeing a 7X reduction in branch mis-predicts,” he said. Arm also has focused on reducing its instruction footprint by reducing the cache miss rate, which Pellegrini said has decreased 1.4X. Along with that, L2 accesses are down 2.25X.
This is a different way of looking at processor efficiency and performance per watt. While most processor companies approach it from the angle of doing more in the same power budget, other companies are looking at doing more with less power, which is important in devices with a battery. That includes smart phones, but it also includes chips developed for electric vehicles and robotics.
Arm also will use its mesh network approach to add in third-party accelerators that are customized for specific data types.
Fig. 3: Arm’s customizable Neoverse architecture. Source: Arm/Hot Chips
IBM, meanwhile, introduced an architecture that is both simple and very different. One of IBM’s goals was to make assumptions about when data packets arrives, which essentially raises the pre-fetch concept to a higher level of abstraction. It’s understanding how to make those assumptions that is so difficult, because it effectively applies use models to the architecture ahead of time.
IBM’s approach is to use the most likely configuration for its chip, making tradeoffs up front and setting limits. That has allowed it to consolidate the number of physical layers, according to Jeff Stuecheli, Power systems hardware architect at IBM, running some data through PCIe Gen 4 and the remainder through 25G SerDes. “This is more power- and area-efficient,” Stuecheli said. The company also has done things like moving toward an asymmetrical architecture, which means that states of one accelerator don’t affect the operation of another. “We want to hide state tables from accelerators.”
Fig. 4: IBM’s emphasis on data throughput. Source: IBM/Hot Chips
Connecting the pieces
Putting all of this in perspective, all of the major chipmakers are tackling similar problems in their target markets. They are improving performance per watt through a combination of general-purpose processors and custom accelerators, and in many cases they are making it possible to replace modules more easily and quickly from one market to the next, and as algorithms are updated. They also are improving throughput of data on-chip, off-chip to memory, and prioritizing the movement of different kinds of data.
Many of these approaches are not new ideas, but some of the technology to make this all happen did not exist in the past.
“Creating a common PHY to enable accelerators is one of the key things happening,” said Stuart Fiske, senior design engineering architect at Cadence. “What you’re also seeing is that the processors don’t get simpler. A lot of these companies are trying to create interfaces to accelerators. That doesn’t solve the complexity problem. It’s still a couple-year design cycle, and there is no way around that. But you can enable accelerators to adapt to whatever the latest neural network is.”
The key is balancing the integration of all of these components with enough flexibility to make changes. In effect, all of these chipmakers are designing multi-chip platforms that can be customized for specific markets and use cases, while optimizing performance per watt and improving data throughput.
“Designs are hitting the wall in terms of clock speeds,” said Loren Hobbs, head of product and technical marketing at Silexica. “The way forward is to make each clock cycle as efficient as possible. And with the addition of multicore heterogeneous multiprocessors, that is accelerating the complexity of these chips. You can combine all of these chiplets to improve the processing power, but you need tooling to help distribute and analyze that. You have to map the code base, which is infinitely complex. It requires static, dynamic and contextual analysis.”
The common denominator here is a growing volume of data, regardless of whether that is at the edge or in the cloud. Where that data gets processed and how quickly it can be moved around are critical parts of the architecture.
“Everyone is struggling with CCIX,” said K. Charles Janac, president and CEO of Arteris IP. “If you have an accelerator and two coherent dies, there are too many corner cases to get it to work easily. But now you can use 3D interconnects to hook together a planar CPU and a planar I/O. So this looks like one system to the software, and you have inter-chip links between the network on chip and different die. That way you can support non-coherent and coherent read/write across two die. It makes the interconnect more valuable, but it also makes it more complicated.”
In fact, this is one of the reasons why these architectures have been in the works for awhile. Getting all of the pieces to work together has proven much harder than anyone initially thought.
“The memory controller and the NoC will have to be much more tightly integrated,” said Janac. “The problem is that neither one understands the QoS of the entire chip, and there aren’t any independent memory controller companies left. But memory traffic has to be better integrated to make this work.”
For the chiplet market to really take off, there also need to be open standards.
“There is no standard yet for connecting chiplets,” said Steve Mensor, vice president of marketing at Achronix. “The problem is that you have to be able to talk to them. So you should be able to develop a chip for a socket and have a link and a protocol stack to support it. There are proprietary solutions from AMD and Intel. There also are standard solutions being developed. If I build an ASIC and buy chiplets, I want a standard solution so I can build that chip independently. That’s a fundamental requirement for this model.”
Nevertheless, it does open the door for accelerators built on different ISAs, such as RISC-V.
“This is a new opportunity for small and lightweight hardware accelerators,” said Chris Jones, vice president of marketing at Codasip. “An open interface for startups to build chips could provide another boom cycle for semiconductors, and that will happen all the way to full packaging. There are still some questions around this, such as who’s ultimately responsible for testing of the whole interface, and how this will work with signoff to the interface. We also still have to see what chiplet interfaces look like, whether they will standardize or remain proprietary. But it certainly adds new opportunities for more verification IP, emulation and simulation.”
Changing out components
What isn’t clear yet is what else can change in these architectures. Most of those introduced this week are planar, but there is an option to push some of these designs into the Z axis, as well.
For example, a SerDes adds latency to a design, but the same can be accomplished with advanced packaging technology. TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) and InFO MS (integrated fan-out with memory on substrate) are two such options. Patrick Soheili, vice president of business and corporate development at eSilicon, said the company just developed an CoWoS type of approach using an interposer from UMC.
“You can pick it apart and take it to a different level of abstraction,” Soheili said. “If you look at some of these architectures, it’s inefficient to have lots of little SRAMs if you have a lot of data flowing through it, and efficient when you do large memories. This may sound counterintuitive, but we’ve found that larger memories are more efficient, particularly for AI types of applications.”
What’s next
The market for all of these approaches is just getting going. The key now is to figure out ways to build repeatability and reliability into these different architectures so they can be used in safety-critical applications such as automotive or industrial, as well as across a wide variety of end markets that today are being flooded with various types of data.
What makes these new architectures so compelling is the ability to customize them for specific applications, leveraging architectures that serve as a foundation for this kind of customization. All of the processor vendors are adopting these types of architectures, from FPGA vendors to companies like Nvidia, which rolled out a new chip architecture in a record-breaking six months. But what’s clear is that going forward, the industry will require more tooling, more data analysis, and much better understanding of potential interactions over time as devices are modified and updated.
This is just the beginning of a shift that ultimately will involve the entire semiconductor supply chain. And while scaling will continue, in the processor world it is becoming just one additional knob to turn in a long list that now includes architectures, packaging, materials and workload optimization. Architects are now the drivers of change, and most of them anticipate architectural changes will accelerate as Moore’s Law decelerates. What a difference a year makes.
Related Stories
Big Changes For Mainstream Chip Architectures
AI-enabled systems are being designed to process more data locally as device scaling benefits decline.
Advanced Packaging Options Increase
But putting multiple chips into a package is still difficult and expensive.
The Next New Memories
A new crop of memories in R&D could have a big impact on future compute architectures.
Chiplet Momentum Builds, Despite Tradeoffs
Pre-characterized tiles can move Moore’s Law forward, but it’s not as easy as it looks.
Getting Down To Business On Chiplets
Consortiums seek ways to ensure interoperability of hardened IP as way of cutting costs, time-to-market, but it’s not going to be easy.
The Case For Chiplets
Issues in advanced packaging.
Nicely written.