Part 1: A look at the impact of communication across multiple processors on an SoC and how to to make that more efficient.
Managing how the processors in an SoC talk to one another is no small feat, because these chips often contain multiple processing units and caches.
Bringing order to these communications is critical for improving performance and reducing power. But it also requires a detailed understanding of how data moves, the interaction between hardware and software, and what components can or cannot support it.
“Cache coherency represents a classic tradeoff between hardware complexity and software complexity,” said , fellow and CTO of the IP Group at Cadence. “It is used in a circumstance where you have multiple tasks or threads or processes running on a set of processors, and they need to communicate with one another, which means the data has to be moved from one to the other.”
Without cache coherency, movement of data from one memory to the next needed to be explicitly mapped out. “You had to leave the shared data off in some remote memory, like in DRAM off-chip, and then everybody suffered performance due to the latency to get to the shared memory,” Rowen explained. “The problem with those approaches is either you had a performance problem—everybody had to go off-chip to find it—or you had a little bit of a software complexity challenge in that you had to identify which things needed to be moved, and explicitly move them from one local memory to another local memory.”
Caches potentially can hide the latency to go off-chip, but it requires either flushing caches or moving data to another cache. That, in turn, requires some complex logic to keep track of the state of different lines in the cache, such as a reference of whether that needs to be shared. But when one processor makes a request, it may have to check all the other processors’ caches, which greatly increases memory traffic.
“The performance for everyone suffers a little bit, and often suffers whether there is actually any sharing or not, but it made the software guys’ job easier,” Rowen said. “You paid a hardware price in complexity and bus loading and you got a benefit in terms of a simpler programming model because then the programmer didn’t have to say, ‘This is potentially shared, it needs to be moved.’ You just sort of said everything is potentially shared and it will magically appear in the right place as far as the programmer is concerned, but the little hardware elves underneath are working madly to shuffle the data around in anticipation of possible sharing. That’s the basic tradeoff at work in cache coherency.”
All of this just gets more complex as the number of processors in a chip increases because the traffic that everyone sees is the sum of all the other traffic in the system. “So as you go from two processors to four processors, there is going to be twice as much traffic, which means every one of those twice as many processors is seeing twice as much happening to it, so there is four times as much total work that’s going on as you scale up,” he said. “At a certain point, you can’t keep scaling it up, and use conventional cache coherency because if everybody has to listen to everybody else, you get cacophony.”
The case for cache coherency
This is where cache coherency fits in. Preliminary benchmarking of CPU and GPU coherency shows up to 90% reduction in overheads relating to memory operations, according to Neil Parris, senior product manager at ARM.
Hardware cache coherency can add some complexity to the IP, but it also can provide significant savings for system power. “For example, if data is available on-chip, then we can avoid external, expensive DRAM accesses,” Parris said. “Also, if we can avoid the cache cleaning and cache maintenance, then that will also save memory accesses and wasted processor cycles, allowing us to do more useful work or enter a low power state. Finally, optimizations in the system, such as a snoop filter (also known as a directory), can help save hundreds of milliwatts of memory system power as the amount of shared data and coherent devices increases.”
Still, cache coherency comes down to ensuring that all processors or bus masters in the system see the same view of memory. “If I have a processor that is creating a data structure, then passing it to a DMA engine to move, both the processor and DMA (direct memory access) must see the same data,” he said. “If that data were cached in the CPU and the DMA reads from external DDR, the DMA will read old, stale data. There are three ways to maintain cache coherency: disable caches, software coherency or hardware coherency.”
Software and hardware coherency are distinctly different. “There is common misunderstanding, that a system without ‘hardware cache coherency’ does not have cache coherency at all,” Parris explained. “In fact, without hardware coherency the software, such as device drivers, must take care of this. Software cache coherency can be complex to implement and debug. Software cache coherency must carefully time the cleaning and invalidating of caches. Cache cleaning involves writing ‘dirty’ data from local cache out to system memory. And cache invalidation is about removing stale or invalid data from the cache before reading new data from the system memory. Both tasks need the software developer to execute code on the processor to either clean line by line, or flush the whole cache. Both are expensive on processor effort and power.”
Ittiam Systems, a startup focused on multimedia distribution, has been using software coherency for GPU compute applications. The company reports that 30% of the development effort was spent on the design, implementation and debugging of complex software coherency.
Hardware-based cache coherency removes all these software challenges. The hardware ensures that all the processors see the same view of memory.
The evolution of coherency
As compute jobs evolved into what we now call ‘tasks,’ caches came into use based on the principle of locality as a way of reducing average memory latency. Adding caches made tasks run a lot faster by bringing data closer to the processing units.
Then, as tasks became parallelized, subtasks —composed of a compute component, which is the piece that is actually doing computation—could be run in parallel too. Along with this is a communication component that defines how the subtasks communicate with each other. That communication is done through some sort of shared level of the memory hierarchy, and in a non-coherent system that doesn’t have hardware cache coherence, that’s typically the main system memory, said David Kruckemyer, chief hardware architect at Arteris.
“Now you’re in a situation where you know that you have these caches that help the individual subtasks run faster. You want to communicate among these things, but once you introduce caches, you introduce multiple copies. So this is where the notion of hardware cache coherence comes into play, such that software doesn’t incur the penalty of communicating through memory,” Kruckemyer said.
There are two aspects to this communication. “First, if you are producing data on one, and consuming data on another functional unit, there’s the cost of producing the data,” he said. “If you don’t have caches or cache coherence, it’s pushing out that data to the point where it can be read by the other functional unit. There is some shared level of the memory hierarchy. So on the producer/sender side, there is a cost of pushing that data out to some shared memory location. This ‘cost’ gets into the notion of software cache coherence. Suppose you have a situation where you have processor with a cache, and you have some I/O device that’s sitting on the system that is not cache-coherent-aware. Whenever the driver for that I/O device told the device it needed to read data from memory, the driver would have to go and flush out the cache and push that data out into memory, at which point the driver would then tell the I/O device it’s time to go read the data out of memory. That’s the pushing component. There’s no cost to pushing the data to a location that can be read by another thread of execution or subtask. Once it’s actually out to that shared memory location, then there is the cost of reading that back in by the receiving subtask. What caches do are allow data to be published faster, and read faster. And effectively what cache coherence does is allow you to have all those caches operate in concert. What it is really doing is addressing the communication cost of parallel processing. It’s about reducing the time you spend communicating.”
Aligning hardware and software
Building cache coherency also requires an understanding of the relationship between hardware and software.
“With multiple processor cores a common theme in SoCs today, one of the challenges of the SoC designer is to have local caches, processor cores, and we end up with so many different caches at different levels,” said Kulbhushan Kalra, R&D Manager for ARC Processors at Synopsys. “How do we maintain the coherency? Dealing with the software is not easy and often kills performance.”
In order to make the right choice, the SoC architect needs to have a good understanding of the software architecture and how the data will flow across a chip. “If you know how much data movement needs to happen from one processor to another, in addition to the type of data, number of transactions, etc., then the cache can be programmed to accommodate that,” Kalra said. “In an application processor, normally you’d go with the maximum value. But in the embedded space you can make the tradeoffs depending on the knowledge you have of your software. These tradeoffs have significant impact on the size of the coherency units and the cache sizes you use.”
To be sure, one of the biggest challenges is understanding and aligning the complete system, both software and hardware. “Both need to work together to get the best from hardware coherency,” said ARM’s Parris. “For GPU compute to use full coherency we need the hardware support as well as the software written to use the appropriate API, such as OpenCL 2.0. To reduce risk, many hardware designers will look to license IP with cache coherency implemented rather than build their own, as the complexity of cache coherence protocols is much greater than a simple AXI bus. Many system designers will also check that the IP they use has been tested together. That could include taking a processor, GPU and interconnect all from one vendor.”
Another challenge is how to choose the right combination, and whether to just go flat out and put in the maximum configuration of cache coherency, Kalra said. “This requires analysis of the SoC architecture on the bandwidth required and the latency requirements to get to the right tradeoff. This is also where fast models come into play, along with instruction set simulators, which have some of this modeling built in. These are tools for the architects to do some of this analysis.”
Kalra cautioned that this is not a straightforward analysis. “It’s not something you can run on a core and simulate in a few minutes, and solving this within in the processor only solves half the problem for an SoC.”
So where should the SoC architect begin? First, Cadence’s Rowen said the scale of the system that’s being built must be understood. “When you have a very large scale system, if you are talking about high-performance computing you’re going to have some level of explicit data movement. And then it’s just a question of where do you draw the line. Am I going to propagate that explicit data movement all the way down to the individual core level, or am I going to stop at the level on the order of magnitude of 100 cores and then do coherency at some lower level? Or am I going to take it down to a four-core cluster, at the point in which I switch over to a more coherent model?”
There is also the question of what legacy software is running on it. “If the programmer wrote it assuming cache coherency, you’re probably going to be under enormous pressure to maintain that programming paradigm. Taking a program that hasn’t been developed for explicit sharing and finding the places where the explicit sharing needs to go in is very difficult. Adopting cache coherency is a little bit of a one-way street from the standpoint of software. If you have been explicit, you can stay explicit, but once you’ve gone to implicit data sharing, i.e., assumption of cache coherency, it’s painful to go the other way,” Rowen said.
Further, ARM’s Parris said this needs to be looked at from a big picture standpoint. “Hardware cache coherency design needs a system view. Implemented correctly, hardware cache coherency can bring big performance and power efficiency gains. Of course, the tradeoff is the complexity of the IP involved increases.”
For the system designer, it’s important they trust there has been sufficient validation of the hardware cache-coherent IP, such as processors and cache coherent interconnect. In fact, system testing of all the IP components together is critical to get the best performance.
Part two of this series will discuss how to add cache coherence to an existing design.
Additional Resources
Introduction to cache coherency
Extended System Coherency
Cache coherency and heterogeneous computing
Related Stories
Coherency, Cache And Configurability
The fundamentals of improving performance.
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
Cache coherency is great if you only have a few dozen cores. Once you go big, cache-coherency is impossible due to the network bandwidth required to maintain it.
The following trends make future cache-coherency mostly irrelevant:
1. Many of the most interesting application platforms are written so coherency is not required because there is no inter-processor communication. ie: MapReduce (Hadoop)
2. Coherency only matters is there are writes as well as reads. The really big systems are almost totally read-only other than infrequent updating.
Cache coherency is not impossible at scale, just hard. SGI UV supports more than 1000 Intel Xeon processor cores in a cache-coherent shared memory machine. A high-end 8-socket Intel E7-8890v4 system has 192 cores using the QPI interconnect.
In-memory database users would disagree with your suggestion that coherency doesn’t matter.