Systems & Design
SPONSOR BLOG

Reducing Avoidable Memory Trips In HBM Systems

Last-level cache helps manage data movement and reduces pressure on the external memory subsystem.

popularity

Picture a highway during rush hour. When a road has limited capacity, traffic backs up quickly because only so many cars can move through at once. Adding more lanes increases capacity, but it does not always guarantee a smoother commute. If cars keep flooding onto the highway, if exits are poorly placed, or if drivers have to stay on the road for long distances, congestion can still build. More lanes help, but the system still depends on how efficiently traffic moves.

Memory systems face many of the same challenges. High-bandwidth memory (HBM) enables advanced AI accelerators and high-performance systems-on-chip (SoCs) to move large data sets quickly.

When bandwidth is not enough

This is where memory hierarchy becomes important. Even when total throughput is high, bandwidth determines how much data can move, while latency determines how quickly the system can respond. However, increased memory bandwidth does not eliminate delays. Each round trip to external memory adds time before the compute engine can continue, creating idle cycles that can become a performance bottleneck. When data is fetched suboptimally, HBM systems can hide inefficiencies in bandwidth headroom while still suffering from poor data reuse, unpredictable access patterns, and repeated trips outside the compute die.

A practical answer is to keep more reusable data on chip. A last-level cache (LLC) provides a solution because it sits between compute engines and external memory, as shown in Figure 1. CPUs, GPUs, NPUs, and other accelerators typically include their own local caches to reduce access latency for frequently used data. However, when data must be shared across engines or exceeds the capacity of the smaller caches, the LLC provides a common cache layer that can satisfy those requests before they reach external memory.

Fig. 1: An LLC keeps reusable data closer to compute. (Source: Arteris)

When the requested data is found in on-chip cache, the compute engine avoids the longer trip to external memory, reducing wait cycles and off-chip traffic. HBM provides the high-throughput movement required by large data sets. In these systems, an LLC improves locality by reducing how often requests have to reach external memory.

Using HBM more efficiently

Table 1 below shows three representative tiers in the memory hierarchy. GDDR5, HBM3E, and on-chip SRAM used as an LLC each play a different role in moving data through a modern SoC. Comparing them side by side helps illustrate the tradeoffs involved.

Memory Type Where It Sits in the System Primary Role Engineering Takeaway
GDDR5 External memory subsystem Provides external memory bandwidth for graphics processors and accelerators Delivers significantly more bandwidth than traditional DRAM, but data still travels over relatively long paths to reach the compute engines
HBM3E In the same package as the SoC, connected through a high-bandwidth interface Provides extremely high-throughput memory access for AI, HPC, and data-intensive workloads Dramatically increases available bandwidth and reduces latency compared to traditional external memory, but data must still leave the compute die and return
On-chip SRAM used as LLC On the compute die, close to CPUs, GPUs, NPUs, and other accelerators Stores frequently accessed or latency-sensitive data Fastest access path in the hierarchy; reduces trips to external memory and helps convert available bandwidth into usable system performance

Table 1: How memory tiers fit into the SoC data path. (Source: Arteris)

CodaCache by Arteris is a configurable LLC IP solution designed for complex SoCs. It helps designers place a high-performance cache layer between processing elements and external memory resources. CodaCache sits in the LLC path between upstream interconnect traffic and downstream memory access, using on-chip SRAM for cache storage. Its role is to help keep high-value data closer to the initiators that request it.

This approach is useful in complex SoCs where data reuse, irregular access patterns, latency sensitivity, or contention among multiple compute engines can affect performance. In these situations, keeping more accesses local can reduce pressure on the external memory subsystem.

By handling more requests on chip, this cache layer helps manage data movement efficiently and maintain overall performance. Figure 2 illustrates the difference between a cache hit and a cache miss, where a hit follows a shorter on-chip path while a miss must travel to external memory.

The Arteris analysis in the figure below shows that adding an LLC can reduce average memory latency from 83 ns to 67 ns.

Fig. 2: LLC hits avoid longer external memory access. (Source: Arteris)

The diagram above highlights the performance impact of adding an LLC to the memory subsystem. An LLC with a 25% hit rate can reduce average memory latency by more than 20%. This demonstrates how even a modest cache hit rate can improve system responsiveness and memory access efficiency.

Reducing power beyond latency

The shorter hit path also matters for power. The advantage comes from reducing the number of accesses that reach external memory. Every HBM transaction requires activity across the memory subsystem, and sustained HBM use can consume significant power.

  • If the requested data is already present in CodaCache, the access can be satisfied on chip.
  • Cache hits avoid unnecessary HBM accesses.
  • Fewer external memory transactions reduce PHY and memory subsystem activity.
  • Lower demand on the memory path can improve total system efficiency.

In HBM-based AI SoCs, a high CodaCache hit rate can reduce HBM traffic and limit the cycles compute engines spend waiting for data to return from external memory. Even with wider data paths, a read request still has to leave the compute die, pass through the interface, reach the stack, retrieve the data, and send it back.

CodaCache last-level cache IP from Arteris supports SoC performance and power efficiency by reducing effective memory latency, memory bandwidth demand, and memory subsystem activity. The standalone LLC solution is designed to work in conjunction with FlexGen smart NoC IP or Ncore cache-coherent interconnect IP.

The HBM era is not just about building bigger memory paths. It is about making sure compute engines are not left waiting for data. The systems that perform best will not simply be the ones with the most bandwidth. They will use that bandwidth wisely.



Leave a Reply


(Note: This name will be displayed publicly)