Reducing SoC Power With NoCs And Caches

Minimizing power consumption while maintaining high performance and scalability.

popularity

Today’s system-on-chip (SoC) designs face significant challenges with respect to managing and minimizing power consumption while maintaining high performance and scalability. Network-on-chip (NoC) interconnects coupled with innovative cache memories can address these competing requirements.

Traditional NoCs

SoCs consist of IP blocks that need to be connected. Early SoCs used bus-based architectures, which worked well for a single initiator and multiple targets but couldn’t scale for systems with multiple initiators due to performance issues with bus arbitration. Later, crossbar switches were introduced, enabling multiple initiators to communicate with multiple targets, but this solution increased area, routing congestion and power consumption.

Today’s high-end SoCs employ chip-spanning intellectual property (IP) in the form of NoCs, in which data from initiator IPs is serialized, packetized and transmitted through the on-chip network to the designated target IPs. Multiple packets can be “in-flight” throughout the NoC at the same time. A NoC-based architecture results in fewer wires that consume less silicon real estate while dramatically reducing routing congestion and power consumption.

Physically-aware NoCs

Despite all the advantages offered by NoCs, each new generation of SoCs tends to employ an increasing number of IP blocks, with modern SoCs often containing hundreds. Furthermore, the IPs themselves can be significantly more complex than their predecessors. In fact, many IP blocks contain their own IPs connected via an internal NoC.

At the top level of the SoC, any long wires linking widely separated IPs may require the insertion of buffers. The long wires themselves increase delays and power consumption. In addition to consuming power, buffers also increase latency.

Traditional NoCs are routed by hand. The problem is that there are invariably multiple paths by which initiator IPs can be connected to target IPs, and optimal routing solutions are often not obvious. This challenge led to the creation of physically-aware NoC technology, such as FlexNoC from Arteris, that can automatically select the optimum routing solution to minimize the wire lengths and buffer usage, reducing area, power consumption (both static and dynamic) and latency. Real-world customer designs using physically-aware FlexNoC technology have reduced total SoC wire lengths by as much as 50%.

Domain-aware NoCs

Today’s high-end SoCs typically feature multiple clock and power domains (figure 1). Clock domains allow power consumption to be reduced using clock-gating, which involves suppressing or blocking the clock to portions of the device that are not currently being used. Similarly, power domains allow power consumption to be reduced by shutting off the power to unused portions of the circuit.

Fig. 1: Today’s SoCs typically feature multiple clock and power domains. (Source: Arteris)

During intensive tasks such as video processing or gaming, the system will increase both voltage and frequency to maintain high performance. Conversely, during idle periods or less demanding tasks, it will lower both to conserve energy, thereby extending the runtime of battery-powered systems and managing the device’s thermal profile.

Performance and power consumption are both functions of voltage and clock frequency. Higher performance burns more power. In addition to clock- and power-gating, many SoCs also support advanced power management features such as Dynamic Voltage and Frequency Scaling (DVFS), in which the processor IP, in conjunction with a Power Management Unit (PMU), can modify the operating voltage and clock frequency. This may take place chip-wide or on a clock- and power-domain specific basis.

In addition to being physically aware, FlexNoC interconnect IP can also be used to generate clock- and power-gating aware NoCs.

Multi-domain AI accelerators

Many SoCs, including those intended for AI applications, employ “xPU” accelerators that comprise arrays of Processing Elements (PEs). For example, a Neural Processing Unit (NPU) may contain arrays of PEs, such as an 8×8 array or larger, each of which may contain multiple IPs and an internal NoC (figure 2).

Fig. 2: High-level block diagram of an SoC containing an NPU. (Source: Arteris)

These PEs, connected by a mesh NoC, may be referred to as soft tiles. FlexNoC includes a NoC tiling capability in which the designer creates a single PE, including one or more Network Interface Units (NIUs), to connect the PE to the mesh NoC. FlexNoC tools can then automatically replicate the PEs, generate the mesh NoC and configure the NIUs in the PEs, all in a matter of seconds.

Furthermore, accelerators like the NPU can themselves be configured to have multiple clock domains (figure 3). Dynamic frequency scaling (DFS) is used in the NPU case as dynamic voltage and frequency scaling can cause NPU performance overheads due to voltage scaling delays. In this case, DFS leads to better performance while adopting lower frequency compared with DVFS.

Fig. 3: Arrays of PEs can be created with multiple clock domains. (Source: Arteris)

In general, NoC tile boundaries can interface into existing NoC clock and voltage domains. Groups of NoC tiles can be turned off when not required, reducing power consumption. The SoC’s PMU can dynamically adjust the clock frequency of the PEs depending on the current workload.

Broadcast dataflow

Some SoC designs require an initiator IP, such as a processor, to send the same message to multiple target IPs like accelerators. One example is the PE accelerators forming the NPU, as previously mentioned.

As opposed to sending the same message multiple times, FlexNoC supports the concept of write broadcast dataflow for use in tasks like AI training and inference. In this case, the initiator IP sends the original message one time and then distributes it by broadcast stations to any applicable target IPs, reducing internal bandwidth usage and saving power.

Cache and coherency considerations

In addition to all the power-saving techniques associated with the non-coherent FlexNoC, the coherent Ncore NoC has a few more tricks up its sleeve.

This is like the broadcast station concept. When something changes the contents of a cache, other processing elements must be made aware that the cache has changed. This involves sending messages around the system.

Unless applicable filtering is performed, the amount of cache coherency traffic can be verbose. Ncore’s coherence protocol is specifically designed to apply filters to reduce unnecessary message transmission. The NoC decides what messages to send and what data needs to be refreshed, ensuring efficient communication without excessive traffic.

In addition to the power consumed by the SoC itself, designers are also concerned with overall system power. One of the more power-hungry activities is transmitting and receiving data from an external memory device. CodaCache, a configurable standalone non-coherent cache IP, addresses this challenge. When used as a Last-Level Cache (LLC), CodaCache improves system performance, data locality, scalability, application responsiveness, cost optimization, and market competitiveness. Notably, it enhances the system’s power efficiency, which is a crucial factor in modern SoC design.

Conclusion

The growing complexity of integrating multiple processing elements, memory systems and communication interfaces into a single SoC demands innovative solutions to optimize power efficiency. Arteris offers a comprehensive suite of IP products, including FlexNoC, Ncore and CodaCache that address these challenges by providing configurable, high-performance interconnect and cache technologies designed to minimize power usage.

FlexNoC provides a physically aware, non-coherent NoC interconnect solution that reduces power consumption through optimized interconnects and advanced power management features. Ncore extends this capability by delivering a scalable, cache-coherent interconnect with heterogeneous coherence support, enabling efficient data sharing and reduced power usage. CodaCache offers a highly configurable last-level cache solution that minimizes main memory accesses, thereby saving power and boosting overall SoC performance.

All these products enable designers to create low-power, high-performance SoCs suitable for a wide variety of markets and applications. Learn more at www.arteris.com.



Leave a Reply


(Note: This name will be displayed publicly)