Making high-capacity data caches more efficient

Cache-management scheme that they say improves the data rate of in-package DRAM


Source: Researchers from MIT, Intel, and ETH Zurich
Xiangyao Yu (MIT), Christopher J. Hughes (Intel), Nadathur Satish (Intel) Onur Mutlu (ETH Zurich), Srinivas Devadas (MIT)

Technical Paper link

MIT News article

As the transistor counts in processors have gone up, the relatively slow connection between the processor and main memory has become the chief impediment to improving computing performance but now, researchers from MIT, Intel, and ETH Zurich have created a cache-management scheme that they say improves the data rate of in-package DRAM caches by 33 to 50 percent.


Kev says:

Sticking a bigger/better cache on a CPU doesn’t fix the fundamental problems with the architecture. Really you want to do processor-in-memory (PiM). Most of the NVMe cards for the storage have ARM (or similar) CPUs managing the storage, and it’s a lot quicker to ask them to do the work than shoveling the data back and forth to a central X86 CPU.

This gets worse as you start die-stacking memory for density, but don’t get any more I/O pins (as with pins on a PCIe card), so you need to move the memory controller and compute into the die stacks.

Surfaces and edges are a dimension down from volumes and areas.

tom andreson says:

I have checked Apple iPhone Support and read something about the way of making the high capacity data caches. It is very important and the efficient as well for the user. We must know about this in detail.

Jacob Monberg says:

Cache management schemes become more important as core counts go up, and when chiplet CPUs create more complex overall caches.

Chiplet CPUs could really benefit from L4 cache. For example if we have 64 core (128 thread) CPU that 4 chiplets, each having two 4 core clusters that both have their own L3 eviction caches, things become complex.

It would mean 64 x L1 data, 64 x L1 instruction, 64 x L2, 8 or 16 L3 eviction caches. We could add 512mb L4 cache and maybe create hybrid L3 that is not normal cache or eviction cache, but hybrid. L4 could act as “separate entity” that could have some logic to predict which data to preload.

I would like to see more research on caching. In the next decade things are going to change pretty drastically. APUs are becoming more popular and powerful, which complicates and/or adds possibilities on L4 caches.

Stacked chips could change things and allow much bigger L3 caches, which could benefit from very different schemes. And also on GPU front, machine learning/AI and GPGPU might benefit from changing caches.

Caches have been my hobby for a long time. I have used countless hours on wondering possible improvements. Larger shared caches with higher latency have almost certainly possible optimizations available. Just don’t have quite enough “transistor level” knowledge to be sure if those are feasible in real life.

Anyway, this article made me happy.

Leave a Reply

(Note: This name will be displayed publicly)