Cache-management scheme that they say improves the data rate of in-package DRAM
Source: Researchers from MIT, Intel, and ETH Zurich
Xiangyao Yu (MIT), Christopher J. Hughes (Intel), Nadathur Satish (Intel) Onur Mutlu (ETH Zurich), Srinivas Devadas (MIT)
As the transistor counts in processors have gone up, the relatively slow connection between the processor and main memory has become the chief impediment to improving computing performance but now, researchers from MIT, Intel, and ETH Zurich have created a cache-management scheme that they say improves the data rate of in-package DRAM caches by 33 to 50 percent.
New interconnects and processes will be required to reach the next process nodes.
Servers today feature one or two x86 chips, or maybe an Arm processor. In 5 or 10 years they will feature many more.
After failing in the fab race, the country has started focusing on less capital-intensive segments.
Rowhammer attack on memory could create significant issues for systems; possible solution emerges.
Gate-all-around FETs will replace finFETs, but the transition will be costly and difficult.
An upbeat industry at the start of the year met one of its biggest challenges, but instead of being a headwind, it quickly turned into a tailwind.
The backbone of computing architecture for 75 years is being supplanted by more efficient, less general compute architectures.
Rising costs, complexity, and fuzzy delivery schedules are casting a cloud over next-gen lithography.
New interconnects and processes will be required to reach the next process nodes.
New approaches to preventing counterfeiting across the supply chain.
Servers today feature one or two x86 chips, or maybe an Arm processor. In 5 or 10 years they will feature many more.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
Sticking a bigger/better cache on a CPU doesn’t fix the fundamental problems with the architecture. Really you want to do processor-in-memory (PiM). Most of the NVMe cards for the storage have ARM (or similar) CPUs managing the storage, and it’s a lot quicker to ask them to do the work than shoveling the data back and forth to a central X86 CPU.
This gets worse as you start die-stacking memory for density, but don’t get any more I/O pins (as with pins on a PCIe card), so you need to move the memory controller and compute into the die stacks.
Surfaces and edges are a dimension down from volumes and areas.
I have checked Apple iPhone Support and read something about the way of making the high capacity data caches. It is very important and the efficient as well for the user. We must know about this in detail.
Cache management schemes become more important as core counts go up, and when chiplet CPUs create more complex overall caches.
Chiplet CPUs could really benefit from L4 cache. For example if we have 64 core (128 thread) CPU that 4 chiplets, each having two 4 core clusters that both have their own L3 eviction caches, things become complex.
It would mean 64 x L1 data, 64 x L1 instruction, 64 x L2, 8 or 16 L3 eviction caches. We could add 512mb L4 cache and maybe create hybrid L3 that is not normal cache or eviction cache, but hybrid. L4 could act as “separate entity” that could have some logic to predict which data to preload.
I would like to see more research on caching. In the next decade things are going to change pretty drastically. APUs are becoming more popular and powerful, which complicates and/or adds possibilities on L4 caches.
Stacked chips could change things and allow much bigger L3 caches, which could benefit from very different schemes. And also on GPU front, machine learning/AI and GPGPU might benefit from changing caches.
Caches have been my hobby for a long time. I have used countless hours on wondering possible improvements. Larger shared caches with higher latency have almost certainly possible optimizations available. Just don’t have quite enough “transistor level” knowledge to be sure if those are feasible in real life.
Anyway, this article made me happy.
I know AMD is changing the cache structure for Epyc Milan, But , they are also working with Micron and samsung to make HBM2e memory stacks on nvme pcie drives that are direct parallel to the cpu 8 or 16 pcie lanes bypassing the pcie buss .so on the server rack these memory drives would be set between each NVME m.2 storage port, so the NVME pcie m.2 memory drive port could be 3 or 4 stacked drive connectors with standoffs between each module allowing for heat dissipation. this would do away with the much slower Dram slots, this new approach is expected to give a 10x speed performance to push massive bandwidth at blazing speeds. as forrest Norrod recently stated that nvme is the future for new hardware applications, i see no reason that this is not absolutely doable.