The High But Often Unnecessary Cost Of Coherence

Cache coherency is expensive and provides little or negative benefit for some tasks. So why is it still used so frequently?


Cache coherency, a common technique for improving performance in chips, is becoming less useful as general-purpose processors are supplemented with, and sometimes supplanted by, highly specialized accelerators and other processing elements.

While cache coherency won’t disappear anytime soon, it is increasingly being viewed as a luxury necessary to preserve a long-standing programming paradigm. As a result, architectures are beginning to limit its use whenever it makes sense.

Caches effectively move selective areas of memory closer to the processor, reducing the wait time to process data and improving the overall performance. A contiguous memory space is an essential element of the von Neumann processor architecture, and one that is relatively cheap and easy to implement, even when multiple processors are added. But there are other ways to speed up processing, and using general-purpose processors and cache for everything is neither the fastest or the most efficient.


Fig. 1: Simplified cache coherency concept. Source: Wikipedia

Caching is complicated to manage, too. It creates a problem, because there are now multiple copies of certain pieces of the memory space. If a variable happens to be in use, and can be manipulated by multiple processors, the hardware has to ensure all those processors see the same value for that variable.

Within the context of a single chip, packaged solutions exist and can be adapted to most requirements. However, when coherence is required across chips, it becomes more difficult. Recently, standards have come into existence, such as CCIX and CXL, that ensure compatibility between systems from multiple vendors.

“Coherence is a contract between agents that says, ‘I promise you that I will always provide the latest data to you,'” says Michael Frank, fellow and system architect at Arteris IP. “It is mostly important when you have a lot of people sharing the same data set. Coherence between equal peers is very important and will not go away.”

But over time, as increasing numbers of processors have been added into designs, the cost associated with this solution has increased. “Heterogenous, dataflow-driven compute means doing more compute on more data,” says Adnan Hamid, executive president and CTO for Breker Verification Systems. “This data is being shared across more networks, while managing security and power. It is hampered by poor old Mr. Von Neumann and his memory bottleneck. Far from losing caches, high-performance systems are forced to implement multiple levels of distributed caches within and across chips, systems, and data centers.”

Many new programming paradigms do not have this problem. Memory — or more correctly, the contents of memory — are provided to a processor on which it performs a task. On completion, the contents of that memory are made available to whoever else wants to work on it. It is simple, clean, fast, and efficient.

Still, change takes time. “The nature of processing continues to be multiple processing threads collaborating to complete some task,” says Millind Mittal, vice president for strategic architecture and fellow at Xilinx. “Even for heterogeneous architectures, one way to see the overall processing scope is that now, instead of homogeneous general-purpose processing (GP) threads on CPUs, work collaboration is among the processing threads that compromise GP threads and task-optimized processing (TP) threads.”

When the tasks are well understood, this can be avoided. “In a dataflow engine, coherence is less important, because you’re shipping the data that is moving on the edge directly from one accelerator to the other,” says Arteris’ Frank. “If you partition the data set, coherence gets in the way because it costs you extra cycles. You have to look up and provide the update information. But if you have a general accelerator type, say, a sea of equal processes with vector engines, coherency is the lifesaver because it allows you to rely on where the data is.”

That simplifies the design process. “Invariably there is a need for shared data structures between these collaborating threads,” says Xilinx’s Mittal. “Cache coherence helps in two ways — ease of achieving memory consistency for a shared data structure (no explicit software-driven coherence operation needed) or performance (when multiple threads have frequent access to a data structure). There are scalability and complexity issues with hardware-based coherency solutions.”

Some of these schemes can get even more complex for particular workflows. “Do I need caches because I need to multiply bandwidth?” asks Frank. “If I have to feed the same data to multiple engines, but I cannot feed the data multiple times out of a single cache, I copy data from main memory into multiple copies of caches. Now, you have to think about how to guarantee these copies remain consistent with each other. That leads us to coherency because now you have to make sure that multiple CPUs that share data, or multiple accelerators that share data, have a way to maintain coherency between them. Sometimes it’s really just as simple as a copy of each other. If it is one-time use, don’t bother with coherency. The effort that is required to make sure that data is coherent is probably too expensive.”

When initially adding coherence, the cost per node is relatively linear. “As the number of nodes increases, you have to make sure that everyone has a way to snoop or interrogate and make sure that other processors’ data is consistent and is coherent,” adds Frank. “It means that you need communication paths between each of the blocks. In addition, you need to make sure that you are not having glitches in the protocol, such as race conditions. You might check for data being at a place, and between you checking the data and getting the response, something has happened to the state of the system. You may get stale information, so that means that the more agents you have, the more opportunities for race conditions, which means you have to create a protocol that is safe against it. In order to avoid going to every agent at a time, because that is expensive in power and time, you try to make a decision at some central point — like a directory, for example, that records if there are multiple copies of that cache lines in the system or if cache lines are shared between agents, or if the cache line is unique and it is only a single copy that is at agent 5. Then when you come and ask for the data, agent 5 will directly send it to you. You have costs that definitely grow linearly superlinearly.”

The full cost has to be fully considered. “Coherency is complicated and expensive to implement in silicon,” says Simon Davidmann, founder and CEO for Imperas Software. “When you go to multiple levels of caching, the memory hierarchy becomes more and more complex, and increasingly it’s full of bugs and consumes larger amounts of power.”

“Gone are the days when it was sufficient to worry only about cache coherency,” says Breker’s Hamid. “Now we must validate system coherency where critical dirty data is not only in a cache far, far away, but highly likely to be in transit somewhere in the network. All this data must remain consistent through traversals of cross-network security firewalls, while various sub-systems along the way are being clock-and voltage-throttled for power management.”

The primary reason to add cache is to improve performance. “The credit goes to Dave Patterson, who introduced well-known analytical methods for performance analysis,” says Frank. “He did all the ground work to make informed decisions about what to build. It is a mixture of art and science. You need to understand what you’re trying to do, but you really have to look at what an algorithm is going to do. Then you can analyze your algorithm for dataflow and for execution sequences. You try to make sure that you can group things into areas. When you have identified flow patterns, you can start to build a model. Using experience and back-of-the-envelope calculation, you propose an architecture. Then you run the model against it and make sure and validate your assumptions.”

How much art is involved? “You can do some calculations for performance, but we can never model the details of the cache subsystems,” warns Imperas’ Davidmann. “We can from a functionality point of view, but not from a performance point of view. We can only provide approximations. For example, with an instruction-accurate model, you don’t model the instruction pipelines, so you don’t know which order memory accesses come in. This can happen because of things like prefetches, branch predictions, and speculative execution. All of those require instruction and data fetch that are out of order. Each of those can change the data in the caches. They might never actually be retired, or they might be discarded. So you might see access to the cache that doesn’t follow program flow.”

To get a more accurate view, you have to be prepared to do more work. “You can make educated decisions about where to place data, how to organize the data in memory, where to process it, and how to process it,” says Frank. “From that, you may see, for example, that some data is shared but rarely updated, and some data is passed from one engine to the other. And sometimes there is enough memory on-chip so you can keep the data on-chip, or if you have to send the data off-chip, you have to figure out if you have enough bandwidth.”

The future for coherence
What does the future look like for coherence? “Our view is that cache coherence is still desirable,” says Xilinx’s Mittal. “However, the domain of hardware coherence will stay bounded to dual sockets and accelerators or memory expansion solutions within a small chassis. Separately, the hardware coherency requirements will be limited to a subset of data structures. Data sharing will be done through a combination of hardware coherence for a portion of data, and others shared through a software coherency model.”

Software coherence is seeing increased attention. “Software can control coherency,” says Davidmann. “You can use instructions to flush the caches, instead of doing it with hardware. You could delay writes to memory, for example. Solutions that have done this have scaled well. When you look at modern processor architectures, they have the ability for software control. RISC -V has instructions for the atomicity. You can lock regions. Maybe you still need some level of coherency, but the trend would appear to be that more could be done in software.”

That requires a different way of thinking. “How will you program accelerators in the future?” asks Frank. “Do you make hard-wired engines that are just strung together like the first-generation GPUs? Or do you build programmable engines that have their own instruction set? If you have to program these individually, you then connect these engines, each executing specific tasks, with a dataflow. The task is the new instruction. You build your instructions, and you’re still programming as if it was a sequential execution. But the compiler figures out, based on the hints that you give it, the hints describing the dataflow, how to partition and to schedule your algorithm. You have to do complete system analysis to understand the dataflow, to understand your needs, when data has to be visible to which agent, and how much this data and visibility is predictable.”

That makes coherence less useful for some applications. “Today inter-host data sharing is through libraries that utilize remote direct memory access (RDMA) or message passing to copy data,” says Mittal. “Moving forward, data sharing across hosts is likely to include the option of using load/store access to inter-host data, with and without hardware coherency. When present, hardware coherency will likely limit the size of the data structures shared. Another evolution likely to happen is toward driverless context management of the processing threads in accelerators (task-optimized processing), similar to context management of GP threads on the CPUs.”

As tasks become increasingly well understood, and architectures are fine-tuned for them, dataflow becomes more predictable and coherence can be more targeted. Modern machine learning architectures show that it adds little value to tasks such as vector matrix multiplies.

But parts of all systems will remain unpredictable. “The first part that you think about is typically how to get data on and off the chip,” says Frank. “That means you think about your overall memory bandwidth, your interfaces, the footprint of memories that you have to support. This defines your subsystem decision. From there, you work your way closer to your programming engines and ask if you need caches because, perhaps, you have pieces of data that have a large level of reuse.”

“Keeping design complexity manageable and interfaces performant typically would limit the scope of hardware coherence to a small number of agents,” says Mittal.

The need for speed requires that memory be brought closer to processing, and that requires cache. The general-purpose programming paradigm and the failure of single processors to scale in terms of performance, have mandated coherence between caches that sit close to each processor. But the solution is too expensive to be used unless it is really required. Thankfully, as many tasks become more data-oriented, rather than control-dominated, dataflow within the system can be better understood and addressed. In some cases, software can manage it more effectively than automatically doing it in hardware. In other cases, the architecture can ensure there is never a need for it.


Bernard says:

Good article!

Khachik Sahakyan (Grovf Inc.) says:

Thank you for an interesting article.

Saumya Thacker says:

Nice Article!!

Leave a Reply

(Note: This name will be displayed publicly)