CPU Performance Bottlenecks Limit Parallel Processing Speedups

Hardware optimizations and well-thought-out software architectures can help, but only incrementally.

popularity

Multi-core processors theoretically can run many threads of code in parallel, but some categories of operation currently bog down attempts to raise overall performance by parallelizing computing.

Is it time to have accelerators for running highly parallel code? Standard processors have many CPUs, so it follows that cache coherency and synchronization can involve thousands of cycles of low-level system code to keep all cores coordinated. CPUs also have a limited ability to exploit instruction-level parallelism based on CPU width and data dependencies.

These CPU performance bottlenecks are real, pervasive, and not easily resolved. Although software developers have a huge role in creating parallelizable algorithms, there may be room for hardware specifically suited to executing parallel code.

Three major bottlenecks
CPU architects spend countless cycles seeking ways to improve their processors to raise performance or reduce power. It’s the main motivator behind the continuing generations of improved CPUs. “CPUs are built to run unstructured application code,” explained Steve Roddy, chief marketing officer at Quadric. “To speed up execution and not put any burden on the programmer to think about either the code or the target machine, modern CPUs have accumulated a laundry list of ever-more exotic features designed to run random unknown code as fast as possible.”

Techniques that have evolved over the years include:

  • Superscalar architectures, which feature decoders that can issue multiple instructions at the same time to a series of function units in parallel with each other.
  • Speculative execution, where the CPU speculatively executes what it considers the most likely outcome of a branch ahead of the branch decision. If the guess is right, it is steps ahead. If not, it must flush the pipeline and run the other branch.
  • Out-of-order execution, where multiple instructions in a sequence run out of order as long as they’re not interdependent.
  • In-flight memory-access tracking, where multiple memory requests may be issued to memory at the same time. Some requests issued later may return earlier, depending on congestion and traffic.

Despite those efforts, several bottlenecks remain. First is the time it takes to fetch data from memory, which typically requires hundreds of cycles. Multiple CPUs and threads can issue multiple requests, overlapping their access times to minimize the apparent latency. Caches help minimize that latency for future accesses, but coherency takes time if one thread changes a value being used by other threads.

The second bottleneck is synchronization arising from data dependencies between threads. If multiple threads want to access the same data, it may be necessary to lock that data for exclusive use while it’s changed, releasing that lock afterwards. Or, if multiple threads are contributing to an overall calculation, there may be a point where all the threads must finish their work before others can proceed. The overhead in managing synchronization can be significant.

The third bottleneck involves instruction-level parallelism, where multiple instructions execute simultaneously. Superscalar CPUs explicitly enable that behavior, but they have limits.

“CPUs can’t handle latency hiding properly. And they can’t handle synchronization properly. And they can’t handle low-level parallelism properly,” said Martti Forsell, CTO and chief architect at Flow Computing.

Coherency delays
Additionally, caches buffer data against the long DRAM access times. But caching systems are typically hierarchical, with a mixture of private and shared caches. “The fundamental choice impacting performance is optimizing for locality of reference through the additional of multiple levels of cache memory,” said Arif Khan, senior product marketing group director in the Silicon Solutions Group at Cadence. Level-1 (L1) caches are closest to the CPU and prioritize fast access over everything else. They are typically private, meaning only the one CPU can access them.

Some CPUs also provide private level-2 (L2) caches. The benefit is the ability to keep more data in cache while relegating some of it to the larger and somewhat slower (but cheaper) L2 cache. With these private caches, if multiple CPUs need to access the same data, then a copy is kept in each individual cache. “In a CPU, coherency issues are inevitable due to private caches,” noted Forsell.

Some CPUs have shared L2 caches, and most level-3 (or last-level) caches are also shared. The benefit of a shared cache is that multiple CPUs can access a single piece of data without requiring separate copies. But often L2 sharing is between, say, two or four CPUs. So an eight-CPU processor with that structure would have two L2 caches, and the sharing doesn’t cross between them. That means if two threads executing on two CPUs with different L2 caches need the same data, then each L2 cache must have its own copy to be kept coherent. Only the last-level cache (LLC) is always shared among all CPUs and requires no coherency.

Fig. 1: Cache hierarchy example. L1 caches are typically private to one CPU. L2 caches may also be, or they may be shared between all or some of the CPUs. The last-level cache is typically shared by all. Source: Bryon Moyer/Semiconductor Engineering

But even sharing gets complicated. For a processor having many CPUs, those CPUs typically are spread all over the die, making it difficult to place one uniform LLC so that all CPUs can access it with the same latency. And bigger caches come with a price. “The larger the cache, the higher the latency,” noted Steven Woo, fellow and distinguished inventor at Rambus. In practice, while the LLC is logically seen as a single unit, it’s physically divided into blocks, each of which is near a CPU.

Coherency protocols have been well established for many years. “MESI is the most common cache coherency protocol, which describes a cache line as being modified (M), exclusive (E), shared (S), or invalid (I),” said Fred Piry, fellow and vice president of CPU technology at Arm.

“MESI is a pretty stable protocol at this point,” observed Woo. “The fact that it hasn’t really changed dramatically tells us that there’s been a lot of past work [to fine tune it].”

CPUs send messages to other CPUs indicating changes in status. “If you’re writing to lines that are already shared, you need to tell the other CPUs, ‘I’m modifying that cache line, so please forget about it. I now have a new version of it.’ Within a mobile phone or laptop, you don’t have that many CPUs, so it’s pretty fast,” Piry said. “If you have hundreds of CPUs in large-scale servers, the delay might be visible.”

Keeping track of those messages is non-trivial. “It can be challenging, because you can get these cases where messages are chasing each other through the system,” said Woo. “In the worst case, if you’ve got a big chip, the core on one corner of the chip has to invalidate the core on the other corner. There’s the time to traverse the whole chip plus the time to for the message to work its way through the hierarchy.”

This typically happens over a dedicated coherency mesh that carries the messages, and for smaller processors, the delay may be limited to 10 or so cycles. “Arm’s Coherent Mesh Network (CMN) with AMBA CHI support is one such fabric,” noted Khan. “IP providers also have solutions in this space.”

But for a processor having many cores, that mesh may experience congestion and delays as messages move through, and worst-case scenarios can cost on the order of 1000 cycles.

Fig. 2: Cache-update delay. Updating caches can take a significant number of cycles for manycore systems. Source: Bryon Moyer/Semiconductor Engineering

This coherency activity happens at the hardware and system-software level behind the scenes, so application-software developers have little control. And yet they’re in the best position to affect performance based on how a program is structured and how variables are shared between threads. For hardware designers, the best options are playing with cache sizes, cache privacy, and the coherency mesh. Improving them can boost performance, but with added die size and cost and with higher power.

“An efficient coherent shared-cache architecture is indispensable to minimize shared-data access latencies for all threads participating in the shared computation,” said Mohit Wani, product manager for ARC-V RPX processors at Synopsys.

Synchronization delays
The two primary synchronization mechanisms are locks (also called mutexes, for “mutual exclusion”) and barriers. Locks take data that’s accessible and in use by multiple CPUs and restricts access to a single CPU. This helps ensure that each CPU is working on the same data with the same value, and that no CPU will be working with an old or intermediate version while others work with a new version.

Barriers restrict execution of code beyond a specific point, allowing all threads contributing to a result to catch up and finish before any thread then proceeds with the result. Reduction algorithms are an example that may involve several threads computing a single result.

There are at least two ways of implementing these two tools, and both have a cost. “Sometimes what people do when you reach a barrier is the equivalent of going to sleep, and something like an interrupt wakes them up again,” explained Woo. “Interrupts are expensive, and it’s going to take a lot of time both to send all the release messages and to get the cores up and running again.” The upside is this saves power because the cores sleep while waiting.

Another approach employs flags or semaphores, and cores execute a busy loop polling the flag to see when it changes. “When the memory location changes value, that’s when they know they can move forward,” said Woo. Such a technique has a much faster response since it involves no interrupt, but more energy is consumed by the cores spinning as they wait.

In addition, threads on an idle core may be swapped out for some other thread in the meantime, and those context swaps are also expensive. “When a task is suspended to yield a processor core while waiting for other tasks to catch up, a significant penalty is incurred from context switching overhead,” noted Wani.

Finally, just as with coherency, once a barrier or lock is released, messages go out to other cores, and for a many-core unit those cycles can add up to thousands. “If you have 100 CPUs in your system, commands will be propagated to all the CPUs in the system,” explained Piry. “And all CPUs will receive the synchronization request and acknowledge that it is received and completed. In addition, they will need to operate on the command, and this is what can take thousands of cycles.”

Fig. 3: Synchronization delays. The top shows CPU n setting up a barrier; it then takes longer to compute its part than the other CPUs. Therefore, the others must wait once finished until CPU n releases the barrier. Similarly, when a CPU sets up a lock, no other CPU can access that data until the lock is released. Source: Bryon Moyer/Semiconductor Engineering

“The CPU hardware can help by providing an instruction set and memory architecture that optimizes thread synchronization and data sharing,” said Wani. Some small processors for embedded use come with hardware semaphores or mailboxes to speed synchronization. That approach, while suitable for a small CPU, doesn’t scale well.

More complex units implement specific instructions, such as in-memory atomics. They perform a read/modify/write sequence atomically, meaning that no other entity can see or influence the results until it’s complete. Such instructions help avoid what would otherwise be data hazards requiring explicit synchronization.

Program architecture is also an important tool. “Synchronization overhead is best addressed by adopting a software architecture that reduces the frequency at which threads need to synchronize,” said Wani.

Instruction-level parallelism
Parallel execution can take place at many levels. Instruction-level parallelism refers to executing more than one instruction at the same time. It’s something modern superscalar CPUs allow, but with limitations. “You can execute multiple instructions in multiple function units in superscalar processes, but the requirement is that the instructions must be independent,” said Forsell.

Figure 4 shows a portion of a 10‑wide CPU micro-architecture, reflecting the widest CPUs commercially available today. By definition, the most parallelism possible here is 10, but with two catches. The first catch has to do with the distribution of function units, while the second arises from data dependencies.

A 10-wide CPU has 10 groups of function units available, and which group gets the instruction depends on what the instruction is. In the fictitious example in Figure 4, each group has two function units, one of which is an integer ALU. So for integer operations, one could have up to 10 running together. But only five have a floating-point unit, two have bit-manipulation units, and there is one each for a multiplier, divider, and fused multiply-add (FMA). So, for example, up to two bit operations can run together, but no parallelism is possible for multiplication, division, and FMA.

Fig. 4: Superscalar-CPU function units. Not all function-unit channels are capable of the same functions, and some functions might be available only in a few – or even one – channels. Source: Bryon Moyer/Semiconductor Engineering

These parallelism opportunities reflect the maximum possible, and it can happen only if the variables involved in parallel calculations aren’t dependent on each other. If one operation, for example, relies on the result of a prior operation, then those two must execute serially.

Here again, software architecture is the most powerful lever available for maximizing parallelism. “The algorithm is always the place to optimize for better performance on a given hardware platform,” said Khan. “Distributed-data-parallel and fully-sharded-data-parallel techniques for model replication and data partitioning allow for the best system utilization and model performance.”

Identifying data dependencies requires software analysis. Declared variables can be analyzed statically, and a compiler can order the object code in a way that maximizes performance in the target CPU. Many programs make extensive use of pointers, however, and those can’t be analyzed statically because their values are determined in real time. Dynamic analysis, monitoring loads and stores for access patterns, may be necessary. In that case, the results of the analysis can help the developer optimize the program for parallelism while ensuring that dependencies are honored.

Control and data planes
Maximal parallelism is possible for threads running through a series of instructions without any branching (so-called basic blocks). Branches introduce uncertainty, and attempts to execute speculatively may end up with a wrong guess and a hiccup as the system adjusts.

Engineers sometimes describe two styles of coding: one for control and one for data manipulation. This distinction underlay the architecture for network-processing chips, with dedicated hardware provided for long strings of calculations on data complemented by a CPU for executing control code, which typically involves many branches. Code in the data plane is much more amenable to parallel execution than is control code.

Given the limited opportunities to improve performance in the classical CPU architecture, some suggest that a separate hardware unit for data-plane-like code would improve parallelism dramatically over what’s possible with a CPU alone. “Platforms of all performance levels are now embracing the idea of offloading structured tasks from the generic CPU and moving those tasks onto dedicated processing engines that are task-specific,” said Roddy.

Startup Flow Computing is providing such a unit. “It’s an IP block tightly integrated into the same silicon as the CPU,” explained Timo Valtonen, CEO at Flow Computing. “It would hang off the bus just as a GPU or neural accelerator would. It’s a configurable array of small processors that can execute parallel code with less need for barriers or locks.”

Fig. 5: Proposed architecture with dedicated parallel-processing unit. The CPU controls the flow and hands workloads to the parallel unit. The CPU effectively executes the control flow, and the parallel unit executes the equivalent of data-place code. Source: Bryon Moyer/Semiconductor Engineering

The proposed unit has a single large cache shared by all its cores, eliminating coherency delays. If the main CPU and the parallel unit were both working on some of the same data, then coherency between the CPU caches and the parallel unit’s cache would be necessary, but Flow sees that as unlikely.

The use of fibers provides a lightweight means of handling what would otherwise be threads. Created by the CPU, they can be passed to the accelerator and avoid the operating-system service calls necessary for handling threads. This can alleviate some of the synchronization delays.

The ability to chain instructions together then maximizes instruction-level parallelism. “Chaining is where you take the output of one instruction and feed it directly into the next instruction,” explained Woo. “Supercomputers from companies like Cray have been doing this exact thing since the 1980s and early ’90s.”

Using a parallel unit requires recompiling code for it. “We do apply a special compilation algorithm to handle code with dependencies for chained execution,” said Forsell. The new code can have the CPU hand off highly parallel sections, and dependency analysis allows the chaining of instructions that will flow through the parallel unit.

The catch, however, is that the entire codebase will be in the language native to the CPU. The compiler doesn’t generate separate object code for the parallel portions. That means that the parallel unit must be equipped with decoders and other logic that mirrors the accompanying CPU. Such work needs to be done in collaboration with the CPU vendor, especially since decoding someone else’s instruction set can be viewed as a violation of intellectual property rights.

Compilers aren’t magic tools, however. “You have to keep in mind that compilers value correctness over performance,” cautioned Woo. “They sometimes have to make conservative assumptions rather than aggressive assumptions. To get the most out of a system, it’s up to the programmer to put directives in or to bypass the compiler’s automatic code generation.”

Moving past incremental gains
Although CPU designers toil endlessly in search of opportunities to improve performance, the low-hanging fruit is long gone. Roofline models, which depict where computation is limited by memory bandwidth and where it’s compute-bound, can help developers identify opportunities for improvement, along with their associated costs. But absent some game-changing development, improvements will be incremental with each generation.

Synopsys pointed to four architectural ideas it uses to improve performance.

  • Make higher-level caches configurable as either a larger cache or a smaller cache plus some memory, with the ability to reconfigure dynamically to suit the current workload;
  • At each cache level, employ pre-fetchers to hide more fetch latency;
  • Even with in-order CPUs, allow out-of-order memory requests so that whatever comes back first can be used even if an earlier request is still pending, and
  • Implement quality-of-service (QoS) using software to give some cores priority over critical workloads.

Would a parallel offload unit be a bigger game changer? It’s possible, providing system developers see it as sliding smoothly into the way they already work. If too much changes, designers, who are mostly conservative by nature, will resist the risks that so much change can introduce into a project. In addition, the incremental silicon real estate must add enough value to keep a chip suitably profitable. That may make it more attractive for purpose-built SoCs, where the value of the added performance is clear.

The tools and development flow associated with such an accelerator would also need to fit smoothly into existing flows, here again respecting coders’ resistance to changing too much. Even if a hardware idea and its associated tools were beyond conceptual reproach, new users would still need to convince themselves that the hardware and software implementations were correct and bug-free.

Whether or not Flow sees success may help shape the direction in which such technology evolves. If it fails, was it because of a fundamental conceptual flaw or because of an execution issue? If the latter, someone else may pick up the ball and run with it. If the former, then the whole idea comes into question.

If, on the other hand, it’s successful, you can bet that others will be out trying to imitate and improve on that success. Given the early days of this proposal, we probably have at least a year, and probably more, to figure out whether we have a winner.

Further Reading
Architecting Chips For High-Performance Computing
Data center IC designs are evolving, based on workloads, but making the tradeoffs for those workloads is not always straightforward.
Optimizing Energy At The System Level
A lot of effort has gone into optimization of hardware during implementation, but that is a small fraction of the total opportunity.



Leave a Reply


(Note: This name will be displayed publicly)