There are a myriad of approaches, and each serves a unique purpose in optimizing performance and efficiency.
In the intricate world of modern chip architectures, the “memory wall” – the limitations posed by external DRAM accesses on performance and power consumption growing slower than the ability to compute data – has emerged as a pivotal challenge. Architects must strike a delicate balance between leveraging local data reuse and managing external memory accesses. While caches are critical for optimizing performance, the resulting coherency management requirements are potentially expensive. How is coherency used most appropriately? When is it required? Let’s dig in a little deeper.
Modern chip architectures are undoubtedly complex. In the recent discussion “AI Accelerator Architectures Poised For Big Changes,” I discussed what the industry calls the memory wall. External DRAM accesses limit performance and power consumption, so architects must balance local data reuse with external memory accesses. To visualize the effect of memory access better, I overlaid the memory access latencies from Joe Chang’s 2018 analysis with the timeline of a 1 GHz clock.
Source: Arteris, Joe Chang’s Server Analysis at https://bit.ly/3Tu335X
The results are pretty striking and explain why caches are crucial for performance. Going out to DRAM is just very expensive. The challenge is keeping the information synchronized when multiple processing units access the same memory. That’s where coherence comes in. Our former chief architect, Michael Frank, described best in “The High But Often Unnecessary Cost Of Coherence” that coherence is a contract between agents that says, “I promise you that I will always provide the latest data to you.” He pointed out that if users have to feed the same data to multiple engines but cannot feed it multiple times out of a single cache, they copy data from main memory into multiple copies of caches. As a result, they have to think about how to guarantee these copies remain consistent. Multiple CPUs that share data, or multiple accelerators that share data, must have a way to maintain coherency between them.
The Intel Xeon architecture, with its L1, L2, and L3 caches, represents what the industry calls “Homogeneous Cache Coherency.” It is an example of a symmetric multiprocessing (SMP) system that uses the MESI (Modified, Exclusive, Shared, Invalid) protocol for cache coherency—this way, multiple cores can efficiently share and manage data coherently. In contrast, the AMD Opteron processor uses a directory-based cache coherency mechanism, where a directory keeps track of the state of each cache line in the processor’s cache. The HyperTransport technology links the processors with other high-bandwidth devices and enables high-speed communication between them, ensuring that all processors in the system have a consistent view of memory and that any changes made by one processor are immediately visible to the others.
Some systems require a consistent view of memory for compute engines, but the architectures to access the memory space, including its computing, are not the same. That’s what the industry calls “Heterogenous Cache Coherency.” The ARM big.LITTLE architecture is a prime example of heterogeneous computing, combining high-performance “big” cores with energy-efficient “LITTLE” cores. While these cores have different cache sizes and performance characteristics, they must share the same memory space. Arm’s Cache Coherent Interconnect (CCI) protocol ensures that data written by the high-performance cores is immediately visible to the energy-efficient cores and vice versa, despite their differences in cache architecture.
Another example is AMD’s Ryzen Processors with Infinity Fabric, combining traditional CPU cores with GPU cores, each having its own cache hierarchies. The Infinity Fabric is a high-speed interconnect to maintain cache coherence between these different cores. It ensures that all computing engines have a consistent view of the memory, allowing for efficient data sharing despite their different cache structures.
It is easy to imagine that the synchronization mechanisms to ensure full coherency can be complex and come at a cost. As the number of processing engines increases, developers must ensure that every computing engine can snoop or interrogate and ensure that other processors’ data is consistent and coherent. That requires communication paths between each block, and one needs to ensure that you are not having glitches in the protocol, such as race conditions.
There also are relevant subsets of full coherency. “I/O coherency” or “One-way Coherency” strikes a balance, allowing one-way coherent agents like accelerators or peripherals to access the memory of the central processing unit coherently. In contrast, the main processing unit does not need to see the memory space of the accelerator or peripheral. For instance, in the case of a coherent CPU and an I/O coherent GPU, the GPU can read the CPU caches without the need to clean data from the CPU caches. As the CPU cannot see the GPU caches, maintenance operations must clean the GPU caches after processing is complete. Shared writes will also snoop and invalidate old data from the CPU. As with full coherency, the hardware coherency operations perform all of the above transparently to software running on the system, ensuring all shared data remains in sync.
As Michael Frank pointed out previously, in a dataflow engine, coherence is less critical, because architects are shipping the data that is moving on the edge directly from one accelerator to the other. Coherence gets in the way if architects partition the data set because it costs extra cycles. Computing engines must look up and provide updated information. But suppose a design has a general accelerator type, like a sea of equal processes with vector engines. In that case, coherency is the lifesaver because it allows designers to rely on where the data is.
Non-coherency eliminates the necessity for synchronization, which is helpful for components like audio engines or serial ports that don’t heavily rely on shared data structures.
When making system decisions, architects typically think about how to get data on and off the chip, considering the memory wall mentioned above. Architects must consider the overall memory bandwidth, system interfaces, and the footprint of memories to define subsystem decisions. From there, they work closer to the computing engines and decide whether caches are appropriate for large amounts of data reuse.
The Toshiba Visconti 5 design – as previously discussed in “Implementing Low Power AI SoCs Using NoC Interconnect Technology” at the Linley Spring Conference 2020 and by Toshiba at MPSoC 2019 – is an excellent example of such a mixed approach, illustrated below. It contains Arteris’ cache coherent interconnect Ncore and FlexNoC with the resilience package to address ISO 26262 safety aspects as the non-coherent interconnect. In total, there are 8 NoC instances separated into a safety island and a processing island. This type of combination of coherent and non-coherent areas of the design is quite typical, and in this case, we find homogeneous coherency for the two quad Cortex-A53 processing units.
Source: Toshiba, Arteris, Linley Spring Conference 2020
Navigating the labyrinth of modern chip architectures, particularly around cache coherency, is no trivial feat. The myriad of approaches, from homogeneous to heterogeneous coherency, each serves a unique purpose in optimizing performance and efficiency. The intricate dance between processing engines, memory, and cache coherence becomes evident when looking at architectures like Intel’s Xeon, AMD’s Ryzen, Arm’s big.LITTLE, and Toshiba’s Visconti 5. The synchronization mechanisms, although complex, are integral to maintaining a seamless flow of consistent and coherent data across the system. Understanding and mastering these coherency strategies is crucial for architects, shaping the future of efficient and powerful chip designs. Arteris’ offerings are critical to helping implement the appropriate levels of coherency.
Happy Holidays! Thanks to Andy Nightingale and Michael Frank for their previous writing that influenced this blog.
Leave a Reply