Level the SoC playing field for hardware accelerators with heterogeneous cache coherency.
In the not-too-distant past, the standard mobile application processor architecture was the predominant one used for most System-on-Chip (SoC) designs, but that is rapidly changing as new systems and applications emerge in the post-mobile computing era. New requirements for autonomous driving are motivating technology innovations: Visual processing, deep neural networks and machine learning place greater emphasis on custom hardware accelerators, and SoC design teams are clamoring for ways scale these elements using the design expertise gained from application processor scaling.
Until now, these hardware accelerators haven’t been able to take advantage of all the resources and SoC infrastructure that application processors have enjoyed. But heterogeneous cache-coherency holds the promise to level the playing field for all the processing elements on a chip.
Extending heterogeneous cache coherency to non-cached accelerators can allow designers to improve performance, reduce power consumption, ease software development, increase bandwidth and lower latency in a balanced approach that is more suited for these emerging applications. The key is to do this in a manner that is simple to implement from both a hardware and software perspective.
Figure 1. Implementing Proxy Caches (shown on the right) allows the five hardware accelerators to acts as cache-coherent peers to the two CPU clusters.
Of course, the benefits of cache coherency are well known for SoCs using processor CPU IP with identical instruction set architectures (ISA). Individual CPUs are implemented in a cluster that shares a Level 2 (L2) cache. These caches are small, fast memories tightly coupled to the processing elements. Benefits include:
• Reduced memory latency which translates to higher performance;
• Higher bandwidth due to high frequency and wide interfaces;
• Lower power consumption due to fewer off-chip DRAM accesses.
ARM-based SoCs, for example, often have cache-coherent processors within homogeneous CPU clusters. Sometimes, multiple clusters are implemented in a “big.LITTLE’’ fashion where the “big” CPU cluster executes high-performance tasks and the lower-power “LITTLE” cluster is used when reduced power consumption is desired. Cache coherency is used within clusters to switch functionality between them. However, this coherency does not extend to the rest of the system.
The issue now is that use of non-coherent system design elements, [a.k.a hardware accelerators], continues to grow in importance as customized IP is developed for neural computing, visual processing, networking and other specialized functions that are not optimally processed by CPUs. These non-coherent elements are customized for the application, and they are vital for capturing market share through the introduction of disruptive capabilities in emerging, rapid-growth markets like autonomous driving and machine learning.
Typically, these customized, non-coherent hardware accelerators are used to offload specialized tasks to accelerate aspects of algorithm execution and free the application processors to focus on other tasks. Because data flows usually involve lots of communication between the hardware accelerators and CPU clusters, these hardware accelerators can benefit greatly by becoming coherent peers with the CPU clusters through the implementation of an on-chip cache-coherent interconnect technology. Communicating through caches is much more power- and latency-efficient than communicating through DRAM, and software is much simpler than managing memory accesses to dedicated internal SRAMs.
But how does a design team that decides to extend cache coherency to their custom non-coherent hardware accelerators accomplish this goal?
The first challenge is to provide one or more caches for the non-coherent hardware accelerators through the SoC interconnect. These processing elements must be able to organize as clusters to optimize data sharing between them while also being coherent with the existing CPU clusters. And these caches, which we can call Proxy Caches, must be configurable regarding their cache organization and protocols. Implementing cache coherency throughout the whole system using other technologies is difficult because of the incompatibility of cache organizations between hardware acceleration IP and homogeneous processor IP.
One scalable approach to address this issue implements cache coherency across the SoC interconnect in a way that allows execution of all known caching policies. In addition to allowing easier integration of non-coherent accelerators using proxy caches, this approach permits the mixing of IP using different cache protocols within the same coherent system. An example is a coherent system that uses both ACE and CHI clusters, implemented as coherent peers to each other.
The second challenge is to limit snoop bandwidth as the system scales. The key to solving this issue is to utilize configurable snoop filters that support all of the coherence models. Snoop filter configurability is important to not only accommodate different cache organizations but to also assign processing elements and clusters to various snoop filters based on system requirements and behavior, thus limiting the required snoop bandwidth.
Implementing the new generation of autonomous driving SoCs in this manner provides additional benefits beyond tighter system integration of previously non-coherent hardware accelerators. Architects can now implement their designs in a modular approach that allows them to scale up or down the number of clusters or processing elements to meet performance goals, without wasting resources. Furthermore, existing hardware accelerators (and their software) do not need to be redesigned to take advantage of this approach, which shortens time to market. In effect, these methods of implementing heterogeneous cache coherency provide a level playing field for all the on-chip processing elements, thereby optimizing performance, power, area and software reuse.
Leave a Reply