All You Need Is Cache (Coherency) To Scale Next-Gen SoC Performance

As ‘fixing it in software’ adds bloat and transistor scaling slows down, designers should get heterogeneous.

popularity

Life on the SoC performance front remains a withering battle sometimes, because things can seem fairly bleak. As transistor scaling becomes more expensive below 10-nanometer feature sizes, every day it becomes harder to double performance every 18-months or so and stay competitive. Nowhere is the pain of this battle more acute than in consumer and automotive systems, where low cost is the key to sustained success.

It used to be that we could fix some of the short-comings of hardware with a monumental software development effort. That’s no longer the most viable option, as code bloat takes its toll on performance and the cost of developing and maintaining software results in unmanageable headcounts and runaway costs.

Designers need to start looking for new ways to achieve performance, cost, area and power consumption optimization. Because, quite frankly, if we can’t improve delivery on performance expectations, we are not providing value to the supply chain and we become replaceable.

That’s why I say, “All you need is cache.” Coherency, that is.

Cache coherency has been highly effective in homogeneous multi-core processing subsystems, like SMP clusters. However, it has proven much more difficult to include the highly differentiable portions of systems, like hardware accelerators, into the realm of cache-coherency. Cache coherency today is largely the domain of processors within the same family or instruction set architecture (ISA).

linley-block-diagram

Heterogeneous cache coherency to the rescue

The idea behind heterogeneous cache coherency is provide the whole system a shared view of memory, allowing processing elements outside of the SMP cluster to become full coherent peers to the existing cache coherent CPU clusters. This can boost system performance, both in regards to low latency requirements for critical transaction and higher achievable bandwidth. The trick here is to do it in a scalable fashion and not break the bank on process migration or monumental software development.

In essence, heterogeneous cache coherency democratizes cache coherency, enabling not only CPU clusters, but also accelerators for video imaging, machine learning, graphics, and other functions to share in the benefits of cache coherence. When implemented correctly, heterogeneous cache coherent systems access external DRAM less frequently than their non-coherent counterparts, resulting in better performance and, sometimes just as important, significantly better power consumption.

The death of Moore’s Law for CMOS: no more free lunch

DRAM efficiency is not the only benefit of heterogeneous cache coherency. Extending cache coherence beyond the CPU processing complex allows designers to select the appropriate processing element for each task while simplifying software development. It allows designers to create more efficient systems from both the hardware standpoint (useful work per die area, or per mAh) and software standpoint (useful work per line of code).

Design teams will be at a competitive disadvantage if they choose not to migrate to more advanced semiconductor technology processes unless they can create a better product using their existing process node. After all, consumers always want the next product to be faster and have longer battery life than the previous version. The only way to do this while staying in the same process technology node is to create a more efficient processing system and to use the best processing element for each task.

Flexibility is key

The problem is that until today, the market left much to be desired in terms of the options available to extend cache coherency across entire designs. System designers want the appropriate processing elements to execute the tasks for which they are best suited. But they also don’t want to overdesign a system and pay a power consumption or die cost penalty of it. And for performance and power consumption, they can’t afford to have every processing element, no matter how big or how small, constantly going off-chip for memory access.

To meet these needs cache coherency needs to be customizable, flexible and configurable. No two chip designs are ever the same. Engineers in each industry need to select the right mix of processing elements that will accomplish their tasks with the lowest latency, the highest performance and the lowest cost. And the chosen cache coherent interconnect IP needs to adapt gracefully to these choices.

Want to learn more?

A paper explaining such a configurable heterogeneous cache coherent interconnect was recently written by The Linley Group. Its author is Senior Analyst Loyd Case and the title is, “Easing Heterogeneous Cache Coherent SoC Design using Arteris’ Ncore Interconnect.”

In addition, two industry organizations advocating a heterogeneous approach are the Heterogeneous System Architecture (HSA) Foundation and the Cache Coherent Interconnect for Accelerators (CCIX) Consortium.

When Moore’s Law was chugging along and software efforts were predictable, a heterogeneous approach may have seemed esoteric or even exotic. But as the arc of CMOS transistor scalability reaches its apex, the most innovative companies are looking more closely at unexplored but promising ways to achieve their system performance goals. After all, system designers will always need higher performance and lower costs to deliver true innovation.