How Cache Coherency Impacts Power, Performance

Second of two parts: Why and how to add cache coherency to an existing design.

popularity

As discussed in part one, one of the reasons cache coherency is becoming more important is the shared common memory resource in designs today. Various agents in the design want to access the data the fastest they can, putting pressure on the CPU complex to manage all of the requests.

Until a generation ago, it was okay for the CPU to control that memory and have access to it, as well as be the arbiter for other agents like a GPU or an accelerator. “They could all say, ‘I need this memory. Hey CPU, fetch it for me,’” said Sundari Mitra, CEO of NetSpeed Systems. “‘And I will allow the CPU to tell me which memory it is that I can work with and that transaction is fast enough,’ because a CPU has its Level 1 Cache, its Level 2 Cache, potentially its Level 3 Cache, and then access to main memory. The CPU can manage this transaction.”

But as more real-time decisions are made—whether it is graphics-based, or some new acceleration unit for an IoT or automotive market needs to make some real time decisions — it needs access to that memory independent of what is happening to the CPU, she explained. So cache coherency, rather than being captive to a CPU subsystem, is now being opened up where the GPU and other agents need access to it. This becomes, in effect, I/O coherency, or some form of coherency that is coupled to the CPU/GPU fabric.

In many designs, coherency is maintained in the software because the hardware doesn’t contain coherency. Design teams would like to move it to the hardware, but that can be a daunting task, said Kulbhushan Kalra, R&D manager for ARC processors at Synopsys.

Chris Rowen,  fellow and CTO of the IP Group at Cadence, agreed that it’s difficult but possible. “You will need to change probably both the cores and the interconnect to do it, so conceptually you can have the same structure, but you’re going to replace the plumbing in a meaningful way to do that. You could say you’re going to have the same general type of cores, maybe running the same software or largely the same software, and the same instruction set, connections, memory system, peripherals, and accelerators. But the bus structure that needs to run the coherency protocol, and the processor cores, which are participating in the cache coherency, need to be changed in order to support that enhanced protocol.”

When a design evolves from generation to generation, the engineering team typically considers a number of upgrades to that platform. So the platform may be a conceptual close descendent of the original, but usually there are enough changes taking place when the bus and cores are swapped out that it is thought of as at least a partially new design, rather than a retrofit, Rowen said.

How coherency works
To understand how to add in coherency, it is interesting to note where exactly coherency operates.

Neil Parris, senior product manager at ARM, explained that hardware cache coherency operates at multiple levels:

  • Within a processor cluster. A quad-core processor, for example, has cache coherency between each of the processor cores.
  • Between processor clusters. An ARM big.LITTLE system has big and LITTLE processor clusters allowing improvements in peak performance and overall system efficiency by using the right processor for the right task.
  • Between processors and I/O, such as a networking application with hardware accelerators and interfaces that share data with the processors. I/O coherency is a term used for one-way coherency. In other words, the I/O (in this case an accelerator or interface) can read from processor caches, but the processor cannot read from the I/O.
  • Heterogeneous processing between different types of processors, such as a CPU, GPU and DSP.

Parris noted this last point is an exciting area of development in the industry right now, and said ARM’s latest IP enables the GPU to be fully coherent with the CPU across a cache coherent interconnect, and allows new software models such as fine-grain shared virtual memory. This means sharing of data is as simple as passing a pointer, and that both CPU and GPU can work on the same data at the same time.

When it comes to adding coherence where it previously wasn’t, Parris said that for a processor that includes caches but has a simple bus interface, like AXI, it would be quite a big change to add support for full hardware cache coherency. The interface would need to be upgraded to support AMBA ACE or CHI, and the processor would need to respond to the full coherency protocol.

He noted that it may be much easier to take advantage of an I/O coherent interface such that the interconnect takes responsibility for the coherency. An example of this would be adding an AXI co-processor or accelerator to an I/O coherent interface (AMBA ACE-Lite) on the interconnect. “I/O reads can then snoop into and read from coherent processor caches, and any writes will automatically invalidate old stale data in the processor caches. This I/O device could also benefit from the system cache in the coherent interconnect. For example the ARM CCN (Cache Coherent Network) includes a system cache that can allows allocation by I/O coherent interfaces. This might be thought of as a proxy cache that gives performance benefits to the I/O device, as well as the processors in the system.

Proxy caches
The concept of proxy caches is gaining in popularity. Arteris takes this approach, as well. According to David Kruckemyer, chief hardware architect at Arteris, the way proxy caches view SoCs in general, is through a coherence subsystem within which all of the communication is effectively coherent—meaning that it has coherent access semantics.

There is also a non-coherent subsystem, which has non-coherent access semantics. There is a component in their system that bridges those two subsystems, called a non-coherent bridge, that allows for integration of legacy IP, and new, non-coherent IP. That relies on a cache in the bridge, called the proxy cache which is a proxy for the non-coherent IPs that are talking to it. The proxy cache is kept coherent with respect to all of the other caches in the coherent subsystem, he explained.

Big picture, when putting a system together and considering cache coherency in the context of processors, Rowen noted that engineering teams typically either buy a cluster of processors together in which the cache coherency mechanism is internal to the subsystem IP that they are buying, or else they use a standard bus. “The ACE protocol from ARM on AXI implements it, and there is verification IP to confirm that when you plug into that you are properly following the standard. So it is part of the bus protocol, and conformance with the bus protocol carries with it the conformance with the coherency mechanism.”

However, design team go in two directions, he observed. “One is they say they don’t need all that sophistication and choose a subset of that functionality, and the simplicity will come from throwing things out. They manage complexity by getting rid of some of the complexity. The other way the design teams solve the cache coherency issue is by packaging up the complexity by saying, ‘No, it’s going to fit within a protocol and here is a core that does it. Here is some verification IP that does it. Here is a standard bus that does it. Here is an ecosystem, which creates some rich tools and IP to implement the complexity so that you don’t have to worry about it.’

Do it yourself
And of course, there will also always be a few advanced computer architects who design something different for a very good reason and don’t want to dumb it down. “Every system—ARM ACE, for example — is a compromise among different things that people might want to do,” Rowen said. “It will have strengths and weaknesses, and some people say they will gladly take strengths and weaknesses as long as they don’t have to reinvent the wheel. Other people recognize they can get a 50% efficiency gain from doing it a different way and therefore will take on that challenge.”

He added that getting cache coherency working correctly is a challenge problem because there are so many possible interactions. “You have two or three or four different bus masters, which are all doing these transactions to update the state of different caches, and the ordering of events is terribly important. And it is a grand challenge for verification. As we scale to larger numbers of cores, or we scale to much bigger ratios between the performance of the core and the performance of off-chip memory, we need to continue to innovate in these kinds of questions. It is complicated, and it is a Tier-one sport.”

Related Stories
How Cache Coherency Impacts Power, Performance Part 1
Part 1: A look at the impact of communication across multiple processors on an SoC and how to to make that more efficient.
Coherency, Cache And Configurability
The fundamentals of improving performance.
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.