New Approaches For Processor Architectures

Flexibility and customization are now critical elements for optimizing performance and power.

popularity

Processor vendors are starting to emphasize microarchitectural improvements and data movement over process node scaling, setting the stage for much bigger performance gains in devices that narrowly target what end users are trying to accomplish.

The changes are a recognition that domain specificity, and the ability to adjust or adapt designs to unique workloads, are now the best way to improve performance and improve energy efficiency. While process shrinks will continue to provide some benefits — typically no more than 15% to 20% improvements in performance and power — it’s clear that banking on those improvements alone is no longer a recipe for success. Customization and intelligent optimization are now essential, and a one-size-fits-all processor strategy for most markets is obsolete.

“Two things have happened in the last few years,” said Aart de Geus, chairman and co-CEO of Synopsys. “One is the amount of data massively increased. As a matter of fact, since 2018, machine-created data dwarfs what humans are creating. At the same time, machine learning has just arrived at the point where computation is good enough. So now you can do really cool stuff with it. This has not gone unnoticed. Every vertical market is now saying, ‘I have a lot of data. What if I could do something smart with it.’ And the notion of taking a lot of data, and changing a vertical market even slightly to make it more efficient, has very big economic ramifications. Credit Suisse estimates this is a more than $40 trillion opportunity for smart everything. So people are experimenting with this, and the minute they see a little bit of success, the next question is, ‘How come your chips are so darn slow?'”

This raises some serious challenges for processor vendors, however. They need to include enough flexibility in their designs to win new customers and keep existing customers, but they also need to achieve economies of scale. In addition, they have to prove the benefits of those new designs to customers. Just because a chip outperforms another in a benchmark test doesn’t mean it will perform better in a specific application or use case. And just because it runs faster or more efficiently at one point in time doesn’t mean it will continue to do so in the future. As a result, processor vendors have begun wrestling with what makes one chip better than another for a particular job, as well as how to mass customize these devices to make them affordable and profitable.

Presentations at this year’s Hot Chips 2021 conference had a very different focus than in past years. Microarchitectures are one of several areas that received extra attention as a way of achieving performance and power improvements because they raise the level of abstraction for how a chip’s architecture is actually utilized. That, in turn, makes it easier to customize. Microarchitectures essentially are the implementation details for a particular hardware design. Rather than changing the hardware architecture and creating a new chip for each application, microarchitectures can be used to prioritize and partition processing jobs as needed. That can include everything from dynamically configurable datapaths to smarter caching and optimized branch prediction — within the bounds of the underlying architecture, of course.

Programmable logic vendors have embraced these kinds of approaches for years in order to reduce the performance gap with ASICs, but ASIC vendors have not taken full advantage of them in the past. That’s changing. The selling points for new chip architectures from companies such as Intel, AMD, IBM, and Arm, among others, looks very different than in past years.

Intel’s new Alder Lake architecture is a case in point. It utilizes two different types of cores, one optimized for single-threaded performance and the other optimized for multi-threaded performance, with dynamic scheduling based upon which is best for a particular application at any particular moment in time.

“These two cores are architecturally equivalent, with different microarchitectures and different design points,” said Efraim Rotem, an Intel Fellow, in a presentation at Hot Chips 2021 this week. “The performance goal is to push the limits of low-latency, single-thread performance by building a wider machine, deeper and smarter. It is built to excel on a large footprint called end data, and the design point is high speed. The efficient core is designed to construct a throughput machine and deliver the most efficient computational density.”

Those cores can be mixed in different configurations, depending upon the application. But the goal is to begin segmenting processing architectures based upon the application and the software that will run on it. On the surface the strategy is similar to Intel’s old 386/387 (circa 1986), separating integer and floating point computing, although much has changed since then. Intel has a new operating system scheduler to monitor the runtime instruction mix of each thread and core, as well as the ability to dynamically adapt the processing based upon thermal conditions and power requirements. The different cores are themselves smarter, and the control mechanisms adjust the mix based upon data volume and prioritization.


Fig. 1: Alder Lake conceptual approach. Source: Intel/Hot Chips 2021

AMD is utilizing some of the same approaches in its Zen 3 microarchitecture, again splitting up the integer and floating point data flows for specific applications, such as gaming. AMD also has added a branch predictor, scheduler, and instruction fetch/decode.

“Zen 3 supports simultaneous multi-threading to get that extra performance in an energy-efficient matter when additional threads are available,” said Mark Evers, microprocessor architect at AMD, in a presentation at Hot Chips 2021. “The core instructions through the pipeline starts with the state-of-the-art branch predictor feeding a sequence of addresses to the front end of the core. Instructions are then fetched and decoded, 4 instructions per cycle, from the 32 KB I-Cache, or 8 ops per cycle from the Op-cache, which can hold 4,000 instructions. The resulting ops are placed into the Op Queue, then dispatched up to 6 ops per cycle to the integer or floating point schedulers. And to execute the ops, there are four integer units plus dedicated branch and store units.”

While the Zen 3 microarchitecture, introduced in late 2020, is geared for high performance gaming, AMD decided not to migrate to the latest process node. “It’s all about delivering performance that matters to the user,” Evers said. “As we stayed in the same 7nm technology as the prior Zen 2 generation, these improvements are all down to the new architecture and physical design optimizations. The 19% IPC (instructions per cycle) uplift, access to a larger portion of the L3 cache per core, higher frequencies across the stack, and a unified 8-core complex altogether add up to great gaming performance. It adds up to an approximate 26% average gaming improvement, and as high as 50% on some games.”


Fig. 2: AMD Zen 3 architecture/microarchitecture. Source: AMD/Hot Chips 2021

There are numerous variations on the same themes throughout the processor market. Arm’s new Neoverse processor IP architecture takes advantage of TSMC’s 5nm process, but also heavily leverages changes in the microarchitecture to maximize performance per watt. Arm already has built a strong reputation in low-power computing, particularly in phones. It is now pushing heavily into other markets, such as edge servers, where there are no entrenched market leaders. Arm said it has achieved a 40% improvement in IPC performance based on microarchitecture improvements.

Part of that improvement comes from better branch prediction, which is similar to suggestions made in a Google search. What’s different in the processor world is that precision has a big impact on both performance and energy efficiency.

“Performance is often lost when the core tries to keep up with large footprint applications with lots of hard-to-predict branches,” said Andrea Pellegrini, distinguished engineer in Arm’s Infrastructure Line of Business. “Improvements from version one (N1) are higher bandwidth and lower latency for both the fetch and decode logic. Version two (N2) can fetch up to two times the number of instructions as version one.”

Pellegrini noted that N2 was developed at 5nm, while N1 was developed at 7nm. But the biggest improvements come from a variety of other techniques, such as prefetch, intelligent caching, multi-chip implementations, memory partitioning and expansion, and coherent accelerators. The architecture also relies on dynamic memory bandwidth management, so when there is contention for memory bandwidth, the device will adjust the aggressiveness of memory pre-fetch (see fig. 3 below).


Fig. 3: Memory pre-fetch monitoring and adjustment. Source: Arm

IBM’s Z architecture is the first chip to emphasize higher frequency as a differentiator. Processor makers have been stalled in the 3 GHz and 4 GHz range for years due to power and thermal constraints. The Z’s 5 GHz out-of-order pipeline has a new branch prediction scheme, bigger and faster caches, and embedded accelerators, according to Christian Jacobi, distinguished engineer at IBM and chief architect for the processor.

IBM’s architecture is all about data throughput and the speed at which that data can be processed. The data path is greater than 600 GB/s, allowing the server to perform inference on more than 3.5 million paths with a latency of less than 1 millisecond. All of that is managed by an intelligent data mover and formatter, allowing the chip to optimize performance and to move data wherever it can be optimally processed.


Fig. 4: IBM’s new Z processor. Source: IBM/Hot Chips 2021

Moving data faster is vital to improving performance, and managing that movement requires its own architecture. But moving less data faster is an alternative way of achieving the same goal. Juanjo Noguera, engineering director at Xilinx, described the new Versal AI Edge architecture, where the results of four matrix multiplication kernels can be combined over high-speed connection to produce a single output.

“In the end, we will produce only one output matrix,” Noguera said. “The core has the capability of doing the vector operations, computing the intake data while reading from the stream, and then pushing it out to the output stream. There also is the capability to do data multi-casting to do multiple outputs. So using this same approach, we can have four chains running in parallel to produce end output results. Each of these tiles will have its own independent data coming in through a stream, but we can treat the weights only once and multi-cast it ‘n’ times. This significantly reduces the amount of memory bandwidth that is needed in this architecture.”

Other options for power/performance improvements
Intelligent and faster movement of data is a significant shift in focus for architects. But it’s not the only way to get there. Moving that shorter distances also can have a big impact on performance. The latest processor architectures include everything from pooled memory to virtual caches.


Fig. 5: IBM’s L2 caches interconnected with dual-direction rings. Source: IBM/Hot Chips 2021

The key is to shorten the distance between memory and processor. L2 cache is extremely fast memory, but utilizing that cache across multiple cores and keeping everything in sync is extremely difficult. Achieving that, and sharing the cache by intelligently managing the data flowing in and out, is an effective way to speed up data movement. It also helps chipmakers to customize their chips for specific applications.

This is important for economic reasons, as well. The challenge for commercial processor makers is how to build enough customization into the architectures without producing one-off designs. Systems companies such as Google, Facebook, and Amazon have been developing their own custom chips, particularly for data centers, and more recently automotive OEMs such as Volkswagen and Tesla have been doing the same for cars. If semiconductor history repeats itself, at some point the processor industry will catch up, offering competitive processors for less money.

The goal for processor vendors is to close the gap sufficiently to be able to lure systems companies back to commercially developed processors, and to be able to use the same architecture platforms to go after new markets such as a range of edge servers. One way to achieve that is to use customized accelerator chips that can plug into that architecture, almost like LEGOs, and to load that up with security features so that the safe bet is commercial silicon.

Chiplets (aka tiles) offer one solution. AMD, Intel, and Marvell all have successfully deployed chiplets, and the U.S. government has been pushing a chiplet strategy for mil/aero hardware as a way of reducing the design costs. The challenge there is developing standard interfaces and ways to characterize chiplets — particularly customized accelerators — so they can be integrated into a package.

Until now, however, those chiplets have been developed by the same companies integrating them into a package. Standardized interfaces would open the door to third-party chiplets, including RISC-V and other specialized accelerators, security modules, as well as programmable accelerators from companies such as Flex Logix.

The whole chip industry has embraced some form of advanced packaging to make this possible. OSATs now offer a variety of options, from increasingly complex fan-outs to 2.5D and 3D-IC implementations. The number of developments around these technologies is exploding. But the challenge is developing a simple way to connect them together, and all of the leading-edge foundries are working on this.

Intel is using a bridge technology it calls Embedded Multi-die Interconnect Bridge (EMIB), as well as interposers, and Samsung reportedly is developing similar approaches. TSMC, meanwhile, is working on everything from embedded chiplets into a die at the front-end-of-the-line, as well as 2.5D and 3D-IC.

Conclusion
Scaling will continue for at least five more nodes, according to Intel’s process roadmap. But the benefits of each new node are limited, both in terms of performance per watt and value to the end customer. What’s increasingly important is the ability to develop customized solutions for customers that can be built on architectural platforms.

The push toward much faster and better on-chip data management and much more sophisticated microarchitectures are two key elements in that direction, and all of the major processor vendors are now on board with this kind of approach. The end result will be significantly faster devices with much better performance per watt  — in some cases, orders of magnitude faster. But there also will be increasing divergence between devices aimed at different end markets, each of which has unique needs and demands, and that’s likely to make this a very interesting market to track for years to come.

Related
Inside Intel’s Ambitious Roadmap
Five process nodes in four years, high-NA EUV, 3D-ICs, chiplets, hybrid bonding, and more.
Designing Chips In A ‘Lawless’ Industry
Mind-boggling number of options emerge, but which is best often isn’t clear.
Shifting Toward Data-Driven Chip Architectures
Rethinking how to improve performance and lower power in semiconductors.



Leave a Reply


(Note: This name will be displayed publicly)