Cache-management scheme that they say improves the data rate of in-package DRAM
Source: Researchers from MIT, Intel, and ETH Zurich
Xiangyao Yu (MIT), Christopher J. Hughes (Intel), Nadathur Satish (Intel) Onur Mutlu (ETH Zurich), Srinivas Devadas (MIT)
As the transistor counts in processors have gone up, the relatively slow connection between the processor and main memory has become the chief impediment to improving computing performance but now, researchers from MIT, Intel, and ETH Zurich have created a cache-management scheme that they say improves the data rate of in-package DRAM caches by 33 to 50 percent.
Sticking a bigger/better cache on a CPU doesn’t fix the fundamental problems with the architecture. Really you want to do processor-in-memory (PiM). Most of the NVMe cards for the storage have ARM (or similar) CPUs managing the storage, and it’s a lot quicker to ask them to do the work than shoveling the data back and forth to a central X86 CPU.
This gets worse as you start die-stacking memory for density, but don’t get any more I/O pins (as with pins on a PCIe card), so you need to move the memory controller and compute into the die stacks.
Surfaces and edges are a dimension down from volumes and areas.
I have checked Apple iPhone Support and read something about the way of making the high capacity data caches. It is very important and the efficient as well for the user. We must know about this in detail.
Cache management schemes become more important as core counts go up, and when chiplet CPUs create more complex overall caches.
Chiplet CPUs could really benefit from L4 cache. For example if we have 64 core (128 thread) CPU that 4 chiplets, each having two 4 core clusters that both have their own L3 eviction caches, things become complex.
It would mean 64 x L1 data, 64 x L1 instruction, 64 x L2, 8 or 16 L3 eviction caches. We could add 512mb L4 cache and maybe create hybrid L3 that is not normal cache or eviction cache, but hybrid. L4 could act as “separate entity” that could have some logic to predict which data to preload.
I would like to see more research on caching. In the next decade things are going to change pretty drastically. APUs are becoming more popular and powerful, which complicates and/or adds possibilities on L4 caches.
Stacked chips could change things and allow much bigger L3 caches, which could benefit from very different schemes. And also on GPU front, machine learning/AI and GPGPU might benefit from changing caches.
Caches have been my hobby for a long time. I have used countless hours on wondering possible improvements. Larger shared caches with higher latency have almost certainly possible optimizations available. Just don’t have quite enough “transistor level” knowledge to be sure if those are feasible in real life.
Anyway, this article made me happy.
I know AMD is changing the cache structure for Epyc Milan, But , they are also working with Micron and samsung to make HBM2e memory stacks on nvme pcie drives that are direct parallel to the cpu 8 or 16 pcie lanes bypassing the pcie buss .so on the server rack these memory drives would be set between each NVME m.2 storage port, so the NVME pcie m.2 memory drive port could be 3 or 4 stacked drive connectors with standoffs between each module allowing for heat dissipation. this would do away with the much slower Dram slots, this new approach is expected to give a 10x speed performance to push massive bandwidth at blazing speeds. as forrest Norrod recently stated that nvme is the future for new hardware applications, i see no reason that this is not absolutely doable.