Can Today’s Processor Architectures Be More Efficient?

The low-hanging fruit of processor optimization may be gone, but new technologies are emerging.

popularity

For years, processors focused on performance, and that performance had little accountability to anything else. Performance still matters, but now it must be accountable to power.

If small gains in performance result in disproportionate power gains, designers may need to discard such improvements in favor of more power-efficient ones. Although current architectures undergo a steady cadence of performance and power improvements, additional gains are becoming harder to achieve.

“Everybody’s going through and redoing their micro-architectures, seeing how they can improve them to contain power,” said Prakash Madhvapathy, product marketing director for Tensilica audio/voice DSPs at Cadence.

Many processor features intended to improve computing throughput, such as out-of-order execution, add complex circuitry that raises power and circuit area. Similar improvements would likely not be accepted today because of the power cost. So what opportunities remain with our current processor architectures?

Efficient implementations are not good enough
Many efforts to improve efficiency involve better designs of existing architectures, and there are still some gains to be won there. “When it comes to implementation, there are a lot of power-savings techniques,” said Marc Swinnen, director of product marketing at Ansys.

One very basic approach leverages process improvements to do more with less power. “Moore’s Law isn’t dead,” Swinnen said. “We’re still getting smaller process technologies, and that’s always been the number one way to reduce power. It will run out soon, but it’s not quite there yet.”

This also can drive process decisions. “When you choose a certain process node, you need to have power efficiency in mind, as well,” said Madhvapathy. “22nm is basically 28nm with a much better energy profile.” He noted that 12nm is another popular node for power-efficient designs.

3D-ICs provide a new power point somewhere between a monolithic die and a PCB-level assembly. “A 3D-IC is going to be higher power than a monolithic chip, but a 3D-IC has a much lower-power, higher-speed implementation than is available through multiple chips connected through traditional PCB traces,” noted Swinnen.

Co-packaged optics (CPO) bring optics closer to silicon, and that also can reduce power, but it’s been a long time coming. “CPO has been around for a long time, but it’s been economically very difficult to justify the technological complexity, and the tradeoffs in the end were not necessarily favorable,” explained Swinnen. “That seems to be shifting. It’s partially that the technology’s gotten better, and partially that the need for high-speed digital communication has grown so intense that people are willing to pay more for it.”

Not all techniques are practical
Some implementation techniques sound interesting but bring their own challenges. Asynchronous design is one of them. “On the plus side, every register talks as quickly to the next one as it possibly can,” explained Swinnen. “There is no central clock, so the whole clock architecture goes away. You don’t have slack, where one data path is waiting for some other data path. It’s been around for many decades, but it has failed to break through (except in specific cases) because the performance is unpredictable. It’s a guessing game what that timing is going to be, and every chip can be slightly different because of process variability.”

It’s also not clear that it really saves power in the end. “The self-timing handshakes mean that the flip flops have to be much more complicated,” Swinnen said. “When you net it all out, all flip-flops use more power. And a question remains: ‘Does it really end up saving you much power for all this complication and lack of lack of predictability?’ The net of it has been that it hasn’t really taken off as a design methodology.”

Spurious power or glitch power also can be tamed to reduce power using data and clock gating. “It will add area, but the effect on spurious power can be pretty large,” said Madhvapathy.

This requires analysis to identify the primary contributors. “Not only does it measure the glitch power, but it can also identify what is causing this glitch,” noted Swinnen.

In the end, one can have only so much impact at the implementation level. “There’s a limit to how far you can go with the RTL, which is ironic because most of the power savings opportunities are at the RTL level,” said Swinnen. “The biggest benefit really is at the architectural level.”

Expensive features
Artificial intelligence (AI) computing has pushed design teams up against the memory wall, so given the industry focus on AI training and inference, a huge amount of attention has gone to putting trillions of parameters where they need to be when they need to be there — without burning down the house. But the processors themselves also burn energy, and other workloads will show a different balance between execution power and data-movement power.

Although clock frequencies continue to climb gradually, such changes really are not fueling performance gains as much as they once might have. The real target of improvements has been trying to keep as much of the processor busy as possible. Three architectural features can illustrate the complex changes made for such gains — speculative execution (also known as branch prediction), out-of-order execution, and limited parallelism.

The purpose of speculative execution is to avoid a situation where one enters a branch instruction and must await the outcome before deciding which branches to follow. Waiting until that point will delay results until the system fetches the instructions the branch result dictated — potentially all the way from DRAM. Instead, one branch is followed speculatively — hopefully the most likely branch. Usually, completion of the branch decision will validate that decision, but sometimes it won’t. At that point, one must back out of the speculative calculation and restart the other branch (including that potential instruction fetch from DRAM).

Branch prediction typically accompanies out-of-order execution, a feature that allows some instructions to be executed in an order different from how they appear in the program. The idea is that one instruction may stall awaiting data while another later instruction is ready now. Note that the latter instruction can’t depend on the earlier one, but one of the main limitations of the serial programming paradigm is that instructions must be listed in order, even if they don’t depend on each other. So out-of-order execution is a complex system that can start multiple instructions ahead of time, ensuring that the original program semantics are honored.

Fig. 1: Intel processor micro-architecture example. This particular unit includes out-of-order processing. Because of the need for backward code compatibility, instructions are first converted to microcode before execution. This model has 11 function units, 8 for execution and 3 for data load/stores. Source: I, Appaloosa, CC BY-SA 3.0.

Area vs. performance
These are not simple systems, and they may come with a price disproportionate to their benefit depending on how they’re built. “For example, a branch predictor keeps a list of prior branches taken,” said Russ Klein, program director, high-level synthesis division at Siemens EDA. “And like a cache, that list often uses the bottom N bits of the branch target as a hash key into the list of branches taken. So N could be 4 or 16 or more, and the number of entries in the list could be 1 or 2 or 32. You can store the full target branch address, or maybe only the bottom 12 or 16 bits. A bigger and more detailed memory of what branches were taken results in better performance, but obviously takes more space (and power).”

The resulting benefit can vary accordingly. “A small simple branch predictor might speed up a processor by 15%, whereas a large, complex one could improve performance by 30%. But it might be 10X larger (or more) than the small, simple one,” Klein explained. “In terms of area, who cares, but for power it does become a big deal.”

Cadence improved the performance of some codecs by restructuring them, yielding code with few branches. “We are seeing performance improvement of about 5% to 15%,” said Madhvapathy. “The number of branches in codecs is less than 5%, and almost none in the inner execution loops where we use ZOL (Zero-Overhead Looping).”

More generally, the company finds more branches in typical programs. “Code in the wild has about 20% instructions that are branches,” Madhvapathy. “Each of these represents an opportunity for speculative execution. Performance gains can be 30% or higher, as the average instructions executed per cycle goes up significantly — even if half of these predictions are successful. Combined overhead [branch prediction and out-of-order execution] may be in the 20% to 30% range.”

Klein recalled Tilera’s founder, Anant Agarwal, discussing a Kill Rule. “What the Kill Rule stated was, if you’re going to put a feature into your CPU, it’s going to increase the area, and if the increasing area is greater than the increase in performance that you get, you do not add that feature,” he said.

Parallel computing is the “easy” answer
Parallelism clearly provides another means to higher performance, but what’s available in current processors is limited. There are two ways that today’s mainstream processors provide parallelism — by instantiating multiple cores, and through multiple function units within a core.

A function unit is what would have been a simple arithmetic logic unit (ALU) in the past. It’s what executes the actual instruction. A given function unit will typically be able to execute some number of instructions beyond simple math. They also may include multipliers, dividers, address generation, and even branching. By providing multiple such units, when one is busy, another may be available to work on a different instruction, which may be out of order.

Different processors have different numbers of function units, and code profiling helps to determine the mix and distribution of instruction support in them. This helps parallelize instruction execution where possible, but the processor overhead — such as instruction fetch — happens serially.

Truly parallelizing computation is one of the best opportunities for improving performance, and with a less tricked-out processor, it could be more power efficient. But such a solution isn’t new. Many-core processors were commercially available more than a decade ago and failed to get traction.

Few algorithms are fully parallelizable. Those that are usually called “embarrassingly parallel.” Everything else has a mix of parallelizable code and segments that must run serially. Amdahl’s Law identifies those serial portions as the ultimate limiter. Some programs can be highly parallelized, others not. But even when an algorithm doesn’t appear parallel, other opportunities might exist.

Fractals are an example. “Your f of x is f of x – 1,” explained Klein. “Each pixel is individually computed through a long serial chain. But if you’re doing an image, you’ve got 1,024 x 1,024, or whatever the image size is, so you’ve got a lot of opportunities for parallelism [by computing multiple pixels at the same time].”

Processors for data-center servers today come with as many as 100 or so cores. But unlike the many-core processors before, they’re not used for a single program. They allow execution of multiple programs for different users requiring cloud computing.

The problems with parallel
Even if they can be parallelized, the catch is that processors must be programmed in parallel. That typically has meant explicitly managing the parallel nature of the code, for instance by invoking pThreads. This is much fussier than typical programming, requiring knowledge of data dependencies to ensure that in-order semantics are met. Although some tools have existed for helping with this, none have entered mainstream software development.

In addition, manually managing parallelism may require a different program for different processors. Programs may run but may not be optimal if more threads are necessary than a given processor can manage in hardware. Going to software parallelism may hurt performance owing to context-switching overhead.

The biggest problem is that software developers have turned up their noses at explicit parallel programming. There’s a strong desire that anything new be programmable using the current methodologies. “Software guys have soundly rejected the notion of the 100‑core processor except in one area we’re seeing it start to creep in — the GPU and the TPU,” Klein observed.

This is why the many-core processors failed commercially. Even so, parallelization is primarily about performance. Getting the power down requires a modest core and an aggressive power-down strategy so that idle cores aren’t consuming energy. Parallelism also helps restore overall performance that might have been lost when making a core more efficient.

“My argument would be that a big array of really simple CPUs is the way to go, but it does require a change in the programming methodology,” he said. “The only hope that I’ve got for that happening is for AI to be able to create a parallelizing compiler, which is something we as an industry have never been able to do.”

The practical way we have today to deal with algorithms that bog down on general-purpose processors is to employ accelerators as non-blocking offloads so that the accelerator can handle its task efficiently while the CPU does something else (or sleeps).

Accelerators can be broad or narrow
Accelerators of all types have existed for decades. Today, much attention is devoted to those accelerators that can speed up training and inference given the very specific intense computations required. But such accelerators aren’t new.

“Heterogeneous computing combines processing cores to provide optimized power and performance,” said Paul Karazuba, vice president of marketing at Expedera. “This obviously includes NPUs. NPUs address all AI processing from less-efficient CPUs and GPUs. However, not all NPUs are created equal — not only in approach, but also in architecture and utilization.”

This is because accelerators may be highly specific — even customized — while others will remain more general-purpose. “If the AI workload is well-known and stable, a custom NPU can deliver significant gains in power and cost efficiency,” Karazuba continued. “If you need flexibility to support multiple models or future AI trends, a general-purpose NPU is more adaptable and easier to integrate with existing software ecosystems.”

Customizing an accelerator will tune it more specifically to its workload, and that effort should improve the power efficiency.

“A way to increase the efficiency of the processor subsystem, specific to NPUs, is to create a more application-focused NPU rather than employ a more general-purpose one,” said Karazuba. “Custom NPUs often use specialized MAC arrays and execution pipelines that may be tuned for a specific data type and model structure. General-purpose NPUs comprise configurable compute elements that support multiple data types and typically address a broader range of layers and operators.”

Jettisoning features that aren’t necessary for a given task can yield significant results. “In real-world applications, Expedera typically sees ~3 to 4X gains in processor efficiency (measured in TOPS/W) and >2X gains in utilization, as defined as actual throughput/theoretical maximum throughput, when a custom NPU is deployed.”

What happens when we run out of tricks?
There clearly remain some opportunities for making processors — and processing subsystems — more efficient. But at some point in the not-too-distant future we risk running out of ideas. What happens then?

This is where new processor architectures might prove helpful. Such a change is non-trivial, however, given the massive ecosystems that underlie current architectures. Fortunately, there are some new architectural ideas as well as the possibility of giving up some generality.

Related Reading
Chip Architectures Becoming Much More Complex With Chiplets
Options for how to build systems increase, but so do integration issues.
Connecting AI Accelerators
How to put the pieces together in a complex design with AI is an unsolved problem.
RISC-V’s Increasing Influence
Does the world need another CPU architecture when that no longer reflects the typical workload? Perhaps not, but it may need a bridge to get to where it needs to be.



Leave a Reply


(Note: This name will be displayed publicly)