Heterogeneous Cache Coherence Requires A Common Internal Protocol

How to increase memory bandwidth and reduce latency in next-generation SoCs.


Machine learning and artificial intelligence systems are driving the need for systems-on-chip containing tens or even hundreds of heterogeneous processing cores. As these systems expand in size and complexity, it becomes too difficult to manage data flow solely through software means. An approach that simplifies software while improving performance and power consumption is to implement hardware-based cache coherency, which manages the sharing of data between these processing elements. Hardware cache coherence frees up processing resources for more useful software tasks while providing a common view of the system memory map for all coherent processors.

Figure 1. Implementing a shared common internal protocol allows heterogenous cache coherent systems, like this example with AMBA ACE and CHI clusters sharing a common view of memory with hardware accelerators.

However, implementing a shared view of memory among all the various hardware accelerators and processing elements of a system-on-chip (SoC) is difficult. Many accelerators have no caching architecture at all, and so do not innately understand coherency. Processors with dissimilar instruction set architectures and memory transaction protocols are not able to communicate directly with one another. These issues have hindered attempts to make coherency possible on a SoC-wide basis.What is needed is a way to share memory that replaces the inefficient software options used previously and brings all of the processing elements into a common communication platform, which implements a common chip-wide caching protocol that all processor types can use to exchange data.

Shared common protocol
Both hardware accelerators and dissimilar processor clusters can now share memory in a single coherent system on a heterogeneous SoC through a shared common caching protocol. The protocol is designed to interface with all known protocols, including those produced by Arm, IBM/Power Architecture, Intel, and others. Distributing coherency chip-wide through the SoC interconnect enables dramatic gains in performance and efficiency without adversely impacting silicon area. This shared common protocol is key to increasing memory bandwidth and reducing system latency for the new breed of custom processors and accelerators required to optimize performance in machine learning systems. At the system-level, power consumption and performance benefits result from minimizing off-chip DRAM access.

Table 1. A common internal protocol can represent all the possible cache states of exisiting commercial cache coherence protocols.

Lingua franca: A metaphor from history
To help everyone understand what this caching protocol accomplishes, I often use a metaphor: Lingua franca. As defined in the dictionary, lingua franca is an Italian term for a language that is adopted as a universal dialect between speakers whose native tongues are different. To give historical context, a lingua franca was developed during the early Renaissance era by merchant traders who converged on seaports to conduct business and international relations in the eastern Mediterranean Sea. Wikipedia states lingua francas have developed around the world throughout human history, sometimes for commercial reasons but also for cultural, religious, diplomatic and administrative convenience, and also as means of exchanging information among scientists and other scholars of different nationalities.

In many ways, a shared common caching protocol serves as the lingua franca of SoCs that are composed of multiple heterogeneous processing elements, such as CPUs, GPUs, DSPs, and application-specific hardware accelerators. Different processing clusters could be from Arm, Intel, IBM, Ceva, or Imagination. Hardware accelerators are typically developed in-house to address specific system-level requirements, such as executing specialized tasks for deep learning or visual processing. Regardless of the ISAs or memory transaction protocols, the SoC interconnect can help all of them share a common view of memory through a common protocol that is understood by all.

This lingua franca for SoCs can be implemented with the support of proxy caches, or IO caches, to allow non-caching processing elements to participate as equals with CPU clusters and other coherent processors. This feature can be used by architects to develop a system which simplifies software and hardware scaling when using multiple accelerators in pipelined hardware configurations, such as those often used for neural networks to support machine learning.

Establishing this caching protocol through interconnect technology keeps system costs low by maintaining data in motion within and between on-chip system caches, thereby reducing the need for larger, higher performance off-chip DRAM interfaces. This approach, comprised of an architecture of elements connected by interconnect links is modular and distributed, and it both improves area efficiency and eases physical chip placement. It also saves development cost, because it is easier to achieve final timing closure with a distributed system implemented throughout the SoC versus one centralized IP block which creates routing issues and incurs timing penalties.

Rising to the challenge
The rise in many-core systems leads to a significant challenge: How can all the processing elements within increasingly heterogeneous SoC designs benefit from a shared view of memory? The solution that offers the most scalable approach begins with a lingua franca for all coherent transactions. Employing this shared common protocol via interconnect technology provides a distributed and scalable way for coherency to be communicated in next-generation SoCs.

Leave a Reply