Systems & Design
SPONSOR BLOG

Turbocharging Cost-Conscious SoCs With Cache

Squeeze the most performance out of lower-cost, earlier generation, mid-range processor and accelerator cores.

popularity

Some design teams creating system-on-chip (SoC) devices are fortunate to work with the latest and greatest technology nodes coupled with a largely unconstrained budget for acquiring intellectual property (IP) blocks from trusted third-party vendors. However, many engineers are not so privileged. For every “spare no expense” project, there are a thousand “do the best you can with a limited budget” counterparts.

One way to squeeze the most performance out of lower-cost, earlier generation, mid-range processor and accelerator cores is to employ the judicious application of caches.

Cutting costs

A simplified example of a typical cost-conscious SoC scenario is illustrated in figure 1. Although the SoC may be composed of many IPs, only three are shown here for clarity.

Fig. 1: Portion of a cost-conscious, non-cache-coherent SoC. (Source: Arteris)

The predominant technology for connecting the IPs inside an SoC is network-on-chip (NoC) interconnect IP. This may be thought of as an IP that spans the entire device. The example shown in figure 1 may be assumed to reflect a non-cache-coherent scenario. In this case, any coherency requirements will be handled in software.

Let’s assume the SoC’s clock is running at 1GHz. Suppose a central processing unit (CPU) based on a reduced instruction set computer (RISC) architecture running a typical instruction will consume a single clock cycle. However, access to external DRAM memory can take anywhere between 100 and 200 processor clock cycles (we’ll average this out to be 150 cycles for the purposes of this article). This means that if the CPU lacked a Level 1 (L1) cache and was connected directly to the DRAM via the NoC and DDR memory controller, each instruction would consume 150 processor clock cycles, resulting in a CPU utilization of only 1/150 = 0.67%.

This is why CPUs, along with some accelerators and other IPs, employ cache memories to increase processor utilization and application performance. The underlying premise upon which the cache concept is based is the principle of locality. The idea is that only a small amount of the main memory is being employed at any given time and that locations in that space are being accessed multiple times. Mainly due to loops, nested loops and subroutines, instructions and their associated data experience temporal, spatial and sequential locality. This means that once a block of instructions and data have been copied from the main memory into an IP’s cache, the IP will typically access them repeatedly.

Today’s high-end CPU IPs usually have a minimum of a Level 1 (L1) and Level 2 (L2) cache, and they often have a Level 3 (L3) cache. Also, some accelerator IPs, like graphics processing units (GPUs) often have their own internal caches. However, these latest-generation high-end IPs often have a 5X to 10X price compared to their previous-generation mid-range counterparts. As a result, as illustrated in figure 1, the CPU in a cost-conscious SoC may come equipped with only an L1 cache.

Let’s consider the CPU and its L1 cache in a little more depth. When the CPU requests something in its cache, the result is called a cache hit. Since the L1 cache typically runs at the same speed as the processor core, a cache hit will be processed in a single processor clock cycle. By comparison, if the requested data is not in the cache, the result, called a cache miss, will require access to the main memory, which will consume 150 processor clock cycles.

Now consider running 1,000,000 instructions. If the cache were large enough to contain the whole program, then this would consume only 1,000,000 clock cycles, resulting in a CPU efficiency of 1,000,000 instructions/1,000,000 clock cycles = 100%.

Unfortunately, the L1 cache in a mid-range CPU will typically be only 16KB to 64KB in size. If we assume a 95% cache hit rate, then 950,000 of our 1,000,000 instructions will take one processor clock cycle. The remaining 50,000 instructions will each consume 150 clock cycles. Thus, the CPU efficiency in this case can be calculated as 1,000,000/((950,000 * 1) + (50,000 * 150)) = ~12%.

Turbocharging performance

A cost-effective way of turbocharging the performance of a cost-conscious SoC is to add cache IPs. For example, CodaCache from Arteris is a configurable, standalone non-coherent cache IP. Each CodaCache instance can be up to 8MB in size, and multiple copies can be instantiated in the same SoC, as demonstrated in figure 2.

Fig. 2: Portion of a turbocharged, non-cache-coherent SoC. (Source: Arteris)

It is not the intention of this article to suggest that every IP should be equipped with a CodaCache. Figure 2 is intended only to provide examples of potential CodaCache deployments.

If a CodaCache instance is associated with an IP, it’s known as a dedicated cache (DC). Alternatively, if a CodaCache instance is associated with a DDR memory controller, it’s referred to as a last-level cache (LLC). A DC will accelerate the performance of the IP with which it is associated, while an LLC will enhance the performance of the entire SoC.

As an example of the type of performance boost we might expect, consider the CPU shown in figure 2. Let’s assume the CodaCache DC instance associated with this IP is running at half the processor speed and that any accesses to this cache consume 20 processor clock cycles. If we also assume a 95% cache hit rate for this DC, then—for 1,000,000 instructions—our overall CPU+L1+DC efficiency can be calculated as 1,000,000/((950,000 * 1) + (47,500 * 20) + (2,500 * 150)) = ~44%. That’s a performance boost of ~273%!

Conclusion

In the past, embedded programmers relished the challenge of squeezing the highest performance possible out of small processors with low clock speeds and limited memory resources. In fact, it was common for computer magazines to issue challenges to their readers along the lines of, “Who can perform task X on processor Y in the minimum number of clock cycles using the smallest amount of memory?”

Today, many SoC developers enjoy the challenge of squeezing the highest performance possible out of their designs, especially if they are constrained to use lower-performing mid-range IPs. Deploying CodaCache IPs as dedicated and last-level caches provides an affordable way for engineers to turbocharge their cost-conscious SoCs. To learn more about CodaCache from Arteris, visit arteris.com.



Leave a Reply


(Note: This name will be displayed publicly)