Systems & Design

SPONSOR BLOG

Balancing Memory And Coherence: Navigating Modern Chip Architectures

There are a myriad of approaches, and each serves a unique purpose in optimizing performance and efficiency.

December 21st, 2023 - By: Frank Schirrmeister

In the intricate world of modern chip architectures, the “memory wall” – the limitations posed by external DRAM accesses on performance and power consumption growing slower than the ability to compute data – has emerged as a pivotal challenge. Architects must strike a delicate balance between leveraging local data reuse and managing external memory accesses. While caches are critical for optimizing performance, the resulting coherency management requirements are potentially expensive. How is coherency used most appropriately? When is it required? Let’s dig in a little deeper.

Modern chip architectures are undoubtedly complex. In the recent discussion “AI Accelerator Architectures Poised For Big Changes,” I discussed what the industry calls the memory wall. External DRAM accesses limit performance and power consumption, so architects must balance local data reuse with external memory accesses. To visualize the effect of memory access better, I overlaid the memory access latencies from Joe Chang’s 2018 analysis with the timeline of a 1 GHz clock.

Source: Arteris, Joe Chang’s Server Analysis at https://bit.ly/3Tu335X

The results are pretty striking and explain why caches are crucial for performance. Going out to DRAM is just very expensive. The challenge is keeping the information synchronized when multiple processing units access the same memory. That’s where coherence comes in. Our former chief architect, Michael Frank, described best in “The High But Often Unnecessary Cost Of Coherence” that coherence is a contract between agents that says, “I promise you that I will always provide the latest data to you.” He pointed out that if users have to feed the same data to multiple engines but cannot feed it multiple times out of a single cache, they copy data from main memory into multiple copies of caches. As a result, they have to think about how to guarantee these copies remain consistent. Multiple CPUs that share data, or multiple accelerators that share data, must have a way to maintain coherency between them.

The various types of coherency

The Intel Xeon architecture, with its L1, L2, and L3 caches, represents what the industry calls “Homogeneous Cache Coherency.” It is an example of a symmetric multiprocessing (SMP) system that uses the MESI (Modified, Exclusive, Shared, Invalid) protocol for cache coherency—this way, multiple cores can efficiently share and manage data coherently. In contrast, the AMD Opteron processor uses a directory-based cache coherency mechanism, where a directory keeps track of the state of each cache line in the processor’s cache. The HyperTransport technology links the processors with other high-bandwidth devices and enables high-speed communication between them, ensuring that all processors in the system have a consistent view of memory and that any changes made by one processor are immediately visible to the others.

Some systems require a consistent view of memory for compute engines, but the architectures to access the memory space, including its computing, are not the same. That’s what the industry calls “Heterogenous Cache Coherency.” The ARM big.LITTLE architecture is a prime example of heterogeneous computing, combining high-performance “big” cores with energy-efficient “LITTLE” cores. While these cores have different cache sizes and performance characteristics, they must share the same memory space. Arm’s Cache Coherent Interconnect (CCI) protocol ensures that data written by the high-performance cores is immediately visible to the energy-efficient cores and vice versa, despite their differences in cache architecture.

Another example is AMD’s Ryzen Processors with Infinity Fabric, combining traditional CPU cores with GPU cores, each having its own cache hierarchies. The Infinity Fabric is a high-speed interconnect to maintain cache coherence between these different cores. It ensures that all computing engines have a consistent view of the memory, allowing for efficient data sharing despite their different cache structures.

It is easy to imagine that the synchronization mechanisms to ensure full coherency can be complex and come at a cost. As the number of processing engines increases, developers must ensure that every computing engine can snoop or interrogate and ensure that other processors’ data is consistent and coherent. That requires communication paths between each block, and one needs to ensure that you are not having glitches in the protocol, such as race conditions.

There also are relevant subsets of full coherency. “I/O coherency” or “One-way Coherency” strikes a balance, allowing one-way coherent agents like accelerators or peripherals to access the memory of the central processing unit coherently. In contrast, the main processing unit does not need to see the memory space of the accelerator or peripheral. For instance, in the case of a coherent CPU and an I/O coherent GPU, the GPU can read the CPU caches without the need to clean data from the CPU caches. As the CPU cannot see the GPU caches, maintenance operations must clean the GPU caches after processing is complete. Shared writes will also snoop and invalidate old data from the CPU. As with full coherency, the hardware coherency operations perform all of the above transparently to software running on the system, ensuring all shared data remains in sync.

Do all systems require coherency?

As Michael Frank pointed out previously, in a dataflow engine, coherence is less critical, because architects are shipping the data that is moving on the edge directly from one accelerator to the other. Coherence gets in the way if architects partition the data set because it costs extra cycles. Computing engines must look up and provide updated information. But suppose a design has a general accelerator type, like a sea of equal processes with vector engines. In that case, coherency is the lifesaver because it allows designers to rely on where the data is.

Non-coherency eliminates the necessity for synchronization, which is helpful for components like audio engines or serial ports that don’t heavily rely on shared data structures.

When making system decisions, architects typically think about how to get data on and off the chip, considering the memory wall mentioned above. Architects must consider the overall memory bandwidth, system interfaces, and the footprint of memories to define subsystem decisions. From there, they work closer to the computing engines and decide whether caches are appropriate for large amounts of data reuse.

The Toshiba Visconti 5 design – as previously discussed in “Implementing Low Power AI SoCs Using NoC Interconnect Technology” at the Linley Spring Conference 2020 and by Toshiba at MPSoC 2019 – is an excellent example of such a mixed approach, illustrated below. It contains Arteris’ cache coherent interconnect Ncore and FlexNoC with the resilience package to address ISO 26262 safety aspects as the non-coherent interconnect. In total, there are 8 NoC instances separated into a safety island and a processing island. This type of combination of coherent and non-coherent areas of the design is quite typical, and in this case, we find homogeneous coherency for the two quad Cortex-A53 processing units.

Source: Toshiba, Arteris, Linley Spring Conference 2020

Where do we go from here?

Navigating the labyrinth of modern chip architectures, particularly around cache coherency, is no trivial feat. The myriad of approaches, from homogeneous to heterogeneous coherency, each serves a unique purpose in optimizing performance and efficiency. The intricate dance between processing engines, memory, and cache coherence becomes evident when looking at architectures like Intel’s Xeon, AMD’s Ryzen, Arm’s big.LITTLE, and Toshiba’s Visconti 5. The synchronization mechanisms, although complex, are integral to maintaining a seamless flow of consistent and coherent data across the system. Understanding and mastering these coherency strategies is crucial for architects, shaping the future of efficient and powerful chip designs. Arteris’ offerings are critical to helping implement the appropriate levels of coherency.

Happy Holidays! Thanks to Andy Nightingale and Michael Frank for their previous writing that influenced this blog.

Frank Schirrmeister

(all posts)
Frank Schirrmeister is executive director, strategic programs, system solutions in Synopsys' System Design Group. He leads strategic activities across system software and hardware assisted development for industries like automotive, data center and 5G/6G communications, as well as for horizontals like AI/ML. Prior to Synopsys, Schirrmeister held various senior leadership positions at Arteris, Cadence Design Systems, Imperas, Chipvision and SICAN Microelectronics, focusing on product marketing and management, solutions, strategic ecosystem partner initiatives, and customer engagement. He holds an MSEE from the Technical University of Berlin and actively participates in cross-industry initiatives as chair of the Design Automation Conference's Engineering Tracks.

Knowledge Centers
Entities, people and technologies explored

Shift Left Is The Tip Of The Iceberg

A transformative change is underway for semiconductor design and EDA. New languages, models, and abstractions will need to be created.

by Brian Bailey

Partitioning In The Chiplet Era

Understanding how chiplets interact under different workloads is critical to ensuring signal integrity and optimal performance in heterogeneous designs.

by Ann Mutschler

NAND Flash Targets 1,000 Layers

New techniques go beyond improved deposition and etching, but challenges stack up, too.

by Bryon Moyer

3.5D: The Great Compromise

Pros and cons of a middle-ground chiplet assembly that combines 2.5D and 3D-IC.

by Ed Sperling

AI’s Role In Chip Design Widens, Drawing In New Startups

Focus is on letting engineers do much more with the same or fewer resources — and less drudgery.

by Karen Heyman

What Comes After HBM For Chiplets

The standard for high-bandwidth memory limits design freedom at many levels, but that is required for interoperability. What freedoms can be taken from other functions to make chiplets possible?

by Brian Bailey

Memory Fundamentals For Engineers

eBook: Nearly everything you need to know about memory, including detailed explanations of the different types of memory; how and where these are used today; what's changing, which memories are successful and which ones might be in the future; and the limitations of each memory type.

by The SE Staff

Why Small Fab And Assembly Houses Are Thriving

Megafabs churning out the most advanced processors are not the only game in town.

by Bryon Moyer

Balancing Memory And Coherence: Navigating Modern Chip Architectures

The various types of coherency

Do all systems require coherency?

Where do we go from here?

Frank Schirrmeister

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Recent Comments

About

Navigation

Connect With Us

Balancing Memory And Coherence: Navigating Modern Chip Architectures

The various types of coherency

Do all systems require coherency?

Where do we go from here?

Frank Schirrmeister

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored