Hardware coherency manages sharing automatically, which can simplify software.
Coherency is about ensuring all processors, or bus masters in the system see the same view of memory. Cache coherency means that all components have the same view of shared data. Just as you need both of your eyes to have the same view in order to see properly, it’s critical for every IP block that has access to a shared data source to view consistent data.
For example, if I have a processor that is creating a data structure then passing it to a DMA engine to move, both the processor and DMA must see the same data. If that data were cached in the CPU and the DMA reads from external DDR, the DMA will read old, stale data. Coherency traffic is proportional to the square of the number of processors, which means that chip designers did not have to worry about it until dual-core processors were introduced. Today we have quad- and even octa-cores, highlighting the importance of an effective strategy to implement cache coherence.
There are three mechanisms to maintain coherency:
• Disable caching is the simplest mechanism, but may cost significant CPU performance. To get the highest performance processors are pipelined to run at high frequency, and to run from caches that offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power.
• Software managed coherency is the traditional solution to the data sharing problem. Here the software, typically device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
• Hardware managed coherency offers an alternative to simplify software. With this solution, any cached data marked ‘shared’ will always be up to date, automatically. All processors and bus masters in that sharing domain see the exact same value.
Challenges with software coherency
A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.
Software managed coherency manages cache contents with two key mechanisms:
• Cache Cleaning (flushing): If any data stored in a cache is modified, it is marked as ‘dirty’ and must be written back to DRAM at some point in the future. The process of cleaning or flushing caches will force dirty data to be written to external memory.
• Cache Invalidation: If a processor has a local copy of data, but an external agent updates main memory then the cache contents are out of date, or ‘stale’. Before reading this data, the processor must remove the stale data from caches. This is known as ‘invalidation’ (a cache line is marked invalid). An example is a region of memory used as a shared buffer for network traffic, which may be updated by a network interface DMA hardware; a processor wishing to access this data must invalidate any old stale copy before reading the new data.
Challenge 1: Software Complexity
Software coherency is hard to debug, the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too little it will result in stale data that may cause unpredictable application behaviour, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.
Challenge 2: Performance and power
Where there are high rates of sharing between requesters, the cost of software cache maintenance can be significant, and can limit performance. For example, ARM benchmarking has found that for a networking application processing the header of every data packet might spend more than a third of the CPU cycles on cache maintenance. Part of the challenge is working out which data needs to be maintained. Worst case, the complete cache contents must be flushed, which may displace valuable data that needs to be read back from DRAM.
The chart below shows a simple example of DMA transfer performance for hardware vs. software coherency. For this example the performance of hardware coherency increases as the amount of dirty data in processor caches increases (hit rate). This is because the software coherency version will take longer to clean and invalidate the cache if it has more dirty data.
Extending hardware coherency to the system
Hardware coherency is not a new concept. In fact, the first implementation at ARM was within the ARM11 MPCore processor that was released more than 10 years ago. Here, up to four processor cores are integrated in a single cluster and can run as a “Symmetric Multi-Processor” (SMP), with visibility of each other’s L1 caches and shared L2.
Extending hardware coherency to the system requires a coherent bus protocol. The full ACE interface enables hardware coherency between processor clusters and allows an SMP operating system to extend to more cores. With the example of two clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory (DDR).
The AMBA 4 ACE-Lite interface is designed for I/O (or one-way) coherent system masters like DMA engines, network interfaces and GPUs. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but not cache shareable data.
While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency.
Hardware coherency is fundamental to big.LITTLE processing, as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. GTS places tasks on the appropriate core at a given time.
Summary
Cache coherency is an important concept to understand when sharing data. Disabling caches can impact performance; software coherency adds overhead and complexity; and hardware coherency manages sharing automatically, which can simplify software. ARM’s AMBA 4 ACE bus interface can extend hardware cache coherency outside of the processor cluster and into the system.
Leave a Reply