Adapt Or Perish: A Unified Theory Of Coherency

Why HG Wells’ observations apply to dynamic workloads in SoC applications.

popularity

Evolution is a natural process and more importantly a relatively slow process that has eventually got us here, capable of perceiving, analyzing, and handling complex tasks. As our environment, society, and surroundings became more complex we learned how to adapt at a brisk and instantaneous manner, in this melting pot of a heterogeneous world. The evidence can be seen in all ages, from the politician who adapts to the needs of the voters by projecting an image that would help gain popularity, to a 5-year-old who turns on the charm just long enough to entice you into giving her another piece of candy. The point is that adaptability is vital in an ever changing and dynamic world.

shutterstock_153575651

Similarly, coherency architecture in SoCs has evolved from software-based, snoop-based schemes to more advanced approaches such as the latest directory-based scalable architectures. Most of this evolution has been driven by the need for scalability to support the increase in number of cores and to improve performance in terms of latency and bandwidth.

But is that enough? How do we deal with the various dynamic workloads in today’s SoC applications? Can these architectures survive without the ability to readjust to these shifting environments?

Today’s applications have complex requirements that can create conflicting demands on system resources based on workloads. More so in mobile applications, because the need for power efficiency requires adaptability. Power-optimization technology such as big.LITTLE uses the ability to turn CPUs on/off based on the workload. GPUs support switching between fixed and floating point modes, while some also provide dynamic rendering modes. MODEMs can be turned off, while dynamic voltage and frequency scaling (DVFS) and software power saving features are other techniques used in SoCs to deal with dynamic workloads.

What role does the NoC play in this? Since the NoC is an integral part of the SoC that connects all these IPs, there is a burning need for today’s NoCs and coherent architectures to adapt to dynamic requirements.

Let’s discuss a few of these runtime coherency demands and the various features that enable coherency architectures to adapt to them in real time.

Coherency needs to adapt, not just evolve

Core/cluster shut down (inactive): A very common feature in today’s SoCs is the option to shut down processors for specific workloads in order to save power. This is no different in the real world when we choose to take the high road and shut ourselves off during those awkward conversations with the intent of saving our energy for a potentially more fruitful conversation in the future. Not sure about you, but I use that technique quite often during my conversations with in-laws. Don’t worry, I know for sure they have not subscribed to the SemiEngineering blogs. In big.LITTLE configurations (and other asymmetric configurations) each cluster or core(s) is tuned for specific loads and performance, and depending on the application running, some core(s) are dynamically turned off or put in an inactive state. This knowledge can be taken advantage of in two main ways:

Reduce directory evicts: Caches that are turned off (inactive) need not be tracked in the coherent directory. The directory is an important resource which can instead be used by the active caches in the system.

Improve latency: Snooping can impact latency. Apart from prefetching, the knowledge of which caches are turned off can also be used to avoid snooping them and hence dramatically reducing the coherent traffic latency for all other IPs that generate coherent traffic.

To handle these situations, there is a need for the runtime programmability of ports within the interconnect so that the coherency architecture is aware of these modes and can take appropriate advantage of these modes by reducing directory evicts to improve latency.

In some use cases, like a GPU with a large cache or an external coherent system, it is advantageous to not track that cache in the directory and instead mark them as snoop only. In this case, it would be useful to be able to programmatically specify a mix of coherent agents that do or do not use the directory. This would keep the directory size small and efficient, while potentially reducing the latency by eliminating snoops to certain address ranges. With runtime programmability, software can decide to include them back in the directory tracking mechanism.

Similar to non-coherent traffic, coherent traffic comes in various shapes and forms. Some work better with a direct mapped cache while others demand a high associative cache. With today’s multithreaded multicore systems, all traffic coexists simultaneously, demanding an interconnect that can cater to every need and request. Most architectures allow the associativity of the directory to be configured based on the system needs. However, it would also be useful to be able to adapt the associativity dynamically to the traffic. The benefit of this is that, though the associativity is design-time configured for a typical use case, runtime traffic can take advantage of an increased dynamic associativity based on the need. This could be done under the hood to reduce the directory evicts and as a result reduce latency.

image1

NetSpeed Gemini’s SmartDir technology with Dynamically Adaptive Associative Directory.

Participation in IO coherency has also become a dynamic requirement. Which IPs need to participate in IO coherency and when is a decision made by software based on the application running. While some IO masters need a coherent view of the memory space for most of the applications, that need diminishes for other applications. Ideally, the architecture would offer the ability to fine-tune the performance of IO masters that need to be switched from being non-coherent to IO coherent.

Maintaining coherency in hardware comes at a cost both in terms of performance and power. Hardware coherency adds some extra decision points in the interconnect and care has to be taken to make sure this does not affect the overall quality of service. Furthermore, fully coherent and IO coherent traffic need to be treated fairly based on the application needs. Ideally, QoS capabilities would be runtime programmable to handle not only application level requirements but also expel the noise and congestion created by rouge IPs. This results in a QoS architecture that is robust but malleable to conform with the runtime changing and sometimes contradicting traffic demand on today’s multicore SoCs.

To wrap up, coherency has gone through a lot of dramatic transformations, for the good, in the last decade or so. With the widespread adoption of SoCs and multithreaded cores, multicore systems, asymmetric architecture, multiple OSes, dynamic workloads and diverse applications, there is a pressing need for coherent architectures that are agile and readjust to current and future requirements.



Leave a Reply


(Note: This name will be displayed publicly)