Memory On Logic: The Good And Bad

Do the benefits outweigh the costs of using memory on logic as a stepping-stone toward 3D-ICs?


The chip industry is progressing rapidly toward 3D-ICs, but a simpler step has been shown to provide gains equivalent to a whole node advancement — extracting distributed memories and placing them on top of logic.

Memory on logic significantly reduces the distance between logic and directly associated memory. This can increase performance by 22% and reduce power by 36%, according to one research program.[1] But some problems need to be solved to make it a simple solution.

There are two version of memory on logic that have become quite commonplace and serve as a proof of concept in the commercial world. There is HBM, which stacks DRAM on a small logic die that is connected to the main system across an interposer. The second application places a large L3 cache directly on top of a processor. While this brings memory closer to the processor, it does not make use of the vast interconnect potential between the two dies.

The real opportunity is when a large number of distributed memories are moved from the main logic die and placed directly on top of the logic associated with them. This is true 3D integration, but it does not have all of the complexity associated with distributing logic across multiple stacked dies.

“HBM is technically memory on logic,” says Joe Reynick, manager in the Tessent Division at Siemens EDA. “You have the base die, and then a DRAM stack on top of that. But taking an SoC, removing the memories from that SoC, and using a second chip that’s comprised of pure memories is a big step from that. We are making connectivity through copper pillars, or TSVs, or whatever the technology is to get to from one die to the other, and that brings a host of new problems and advantages.”

Andy Heinig, head of department for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division, agrees. “The logic in an HBM is not really compute logic. It only serves to orchestrate the signals coming from the processor and going to the memory tiles, and vice versa. The current approach of cache on logic is more in the direction of real memory on logic. However, in the case of cache on logic, there are not so many architectural changes compared to a side-by-side approach. Real memory on logic will enable a dramatic performance boost in the future, but only if new architectures are developed.”

Processing has been limited by memory bandwidth for a long time and this trend is not improving. “At some point processing will become limited by the bandwidth of the bus,” says Pradeep Thiagarajan, principal product manager in the Custom IC Verification Division at Siemens EDA. “This gets even more constrained as you go to higher data rates. You build more complex modulation schemes on the interface for sending it and receiving it, and your signal integrity for these various interconnects has to be maintained — especially as it goes up the memory stack.”

By many accounts, 50% of the area in a chip is consumed with memory. “It has also been shown that if you have an interconnect of more than 100 microns on the x,y plane, it is cheaper to go up into the z plane,” says Marc Swinnen, director of product marketing at Ansys. “Anything closer than 100 microns, it is cheaper to stay on the same level. By going up to the z plane, you could have a shorter and faster electrical connection.”

Many of the new architectures being developed are composed of arrays of processors, each of which has associated memory. “We need processing to be close to memory,” says Renxin Xia, vice president of hardware at Untether AI. “There are only a few ways to be close to memory if you are constrained to a flat surface — two dimensional. The logical next step is to start looking at the problem three dimensionally. You can then vertically integrate or tightly integrate to a lot more memory.”

But there are always problems that have to be overcome. “There have been a number of studies attempting to put DRAM on top of logic,” says Kenneth Larsen, senior director of product management at Synopsys. “But DRAM is very sensitive to temperature and requires adjustment of the refresh rates. While there are ways to deal with this in software, it is difficult to not have a performance impact. There are new physical considerations that you need to take into consideration. That’s why I hope we can begin to move the discussion away from assembly, where you stick things together, and maybe talk more about integration, where things can be developed together.”

Yield is a mixed story. “For assembly you have a lot more connections,” says Siemens’ Reynick. “If you’re taking 100,000 memory instances and mapping those to a chiplet, you have all those connections for data, address, and control that need to be made. That could have a yield impact. You might need to think about redundant connections. But the other side is yield improvement due to lower process complexity. If you look at the yield equation, there’s area, there’s defect density, and there’s a parameter called process complexity. Process complexity is basically the number of metal layers you are using. If you’re removing the memories from the design, then the overall process complexity is lower. That has a yield improvement effect on both the memory die and on the logic die.”

Variation becomes a larger problem, especially if multiple processes or nodes are used to manufacture each die. “We could implant a p-type ring oscillator and an n-type ring oscillator into the chip,” says Reynick. “When doing the characterization, you could see the relative speed of each device. After we dice it, known-good-die testing is done. Then, using an OTP (one-time programmable) or e-fuse, you could identify each specific part as to whether it is a slow-fast, a fast-slow, slow-slow, fast-fast, or typical-typical part. Customers might say, ‘I only want slow-fast parts.’ You have to be careful that you don’t throw way most of your parts.”

The pressure is growing to separate SRAM from logic because it no longer scales. “Another constraint on Vdd is SRAM Vmin, which sets the lowest possible supply voltage for a given error rate for embedded SRAMs,” says Robert Mears, CTO for Atomera. “Since the embedded SRAMs are typically the first blocks to fail as voltages are lowered, Vmin often sets the minimum supply voltage. Process technology can provide reduced variability, improve PMOS reliability, and increase drive currents that reduce Vmin by 100mV.”

There could be some new thermal density challenges, though. “The circuit activity factor for 3nm finFET technology is about 1%,” says Victor Moroz, fellow in the TCAD product group of Synopsys. “You cannot have more than about 1% of your transistors switch simultaneously, because it would overheat and melt. But if half of your chip is SRAM, SRAM is extremely lazy. Its activity factor is much less than 1%. It’s almost zero in a rounded perspective. If you remove the SRAM, you may have to reconsider the activity factor within the logic.”

Going vertical has other benefits. “By going vertical, going across different die, we can use different memory technology,” says Untether’s Xia. “We can take advantage of denser memory technologies like DRAM or others. We’re not constrained to SRAM as we would be on the logic die. That could give us an order of magnitude denser memory.”

Takeo Tomine, product manager at Ansys, also pointed out that heat is an issue with ReRAM.[2] “Typically, for advanced technology nodes lower than 7nm, the device size shrinks while supply voltage (Vdd) has been constant — resulting in higher power density and greater metal density, which produce more heat. The self-heating effect is a critical factor influencing the reliability and accuracy of ReRAM. Self-heating becomes most severe where heat is trapped in the transistor device. For ReRAM, temperature variations decrease the Ron/Roff ratio, which is bad for accuracy and reliability in many applications, including AI processing. Careful thermal management is a must, especially in designs with uneven power consumption across different devices. And then, the spreading of this generated heat to nearby layers and devices must be modeled to capture the full-chip heat picture as it evolves over time.”

Thermal becomes a major problem for all such memory layers. “You typically have the processor on the bottom and the memory die above it,” says John Parry, industry director for Electronics & Semiconductor for the Simcenter portfolio within Siemens EDA. “But memory die have a lower temperature limit than logic die. Usually, it’s around 120°C or 125°C for the logic die. It depends a bit on the manufacturing process and the technology that’s used, but the high bandwidth memory will have a temperature limit of 80°C. You typically take heat upward through the memory die. The problem with having memory above the processor is that the processor has to conduct its heat away through something that is itself being heated.”

Some people have looked at flipping everything so that the processor is on top and the memory underneath. “You don’t just have processing the logic in the die, you also have the I/O,” says Reynick. “That I/O has to make the connection to the outside world. There is also a sort of heatsink from the substrate, which it connects to through the balls of the PCB, and so memory on logic is more popular because you might have to do feed-throughs if you wanted to put your I/Os or memory on the bottom.”

It gets more expensive when you also consider power. “TSVs are expensive, they are large, and they have an intrinsic yield issue,” says Ansys’ Swinnen. “The logic chip can talk to the memory, but the logic chip still needs to get to the substrate somehow. The signals and the power have to come up through the memory to the chip. If your chip is using 100 watts, that’s a lot of power to route through a memory. Prosaic problems like this have to be considered. In the z direction, there are thousands of micro-bumps per square millimeter, but they’re very tiny, much lower interconnect density than on chip itself. The z direction doesn’t have the same number of wires per inch as the x and y do.”

Test also becomes a bigger issue. “You are going to have to create new test benches that incorporate sections of circuit from multiple process technologies,” says Thiagarajan. “You have to take into account the connectivity, including any extraction of a channel or a wire route in terms of S parameters, and then connect that to the recipient design, which could be in a different process technology. You will have multiple PDKs, which include the variation of the respective process technologies, and then you simulate that together. You would also need capabilities to do a co-variation-aware design on top of the typical simulator tools. You have to take into account a bigger subsystem on the pre-silicon side to prepare it for the testing once the hardware comes out.”

Interconnect testing becomes a new issue, Reynick notes. “How do we test the interconnect and verify that it is working? We can still test the dies themselves using known good die testing and using wafer probes. If you’re using a PHY, then you need to have a loopback test so you can go all the way to the pad and back in and verify that the test is working. Even if it’s a uni-directional signal, we still recommend making them bi-directional so you can do an internal loop back to the pad and back into the die. We can still do SCAN. We probably still need sacrificial pads. Your test signals and a sampling of power and ground need to be brought out to pads that can be probed, because there aren’t reliable probe cards that can meet the micro-bump pitches for 3D. You need standard spacing for the probe card so that you can do test. There needs to be some test logic on the memory die, as well. When you do memory BiST, we have wrappers around each memory. Those wrappers need to be on the memory die so that we can actually do a memory BiST test of those memories.”

Separating memory and logic onto two dies that are stacked on top of each other holds a lot of promise, while presenting some considerable challenges. But those challenges are not as extreme as the ones to be encountered with logic on logic. That may make it both a good learning exercise for full 3D-ICs, as well as providing the equivalent of a full node advance.

What is learned by doing this will carry into the future, because separating memory technology from logic will provide much higher-density solutions, and re-architecting a processing system will make better use of memory bandwidth. “If only existing architectures were to be adapted to a 3D approach, this would lead to an increase in costs with very little improvement in performance,” says Fraunhofer’s Heinig. “However, finding truly new architectures will take some time and also some research at universities. New tools for efficiently exploring different options also need to be researched and developed.”

1. Power, Performance, Area and Cost Analysis of Memory-on-Logic Face-to-Face Bonded 3D Processor Designs, Agnesina et al. ISLPED 2021.
2. ReRAM Seeks To Replace NOR, Karen Heyman, Semiconductor Engineering Sept. 2023.

Leave a Reply

(Note: This name will be displayed publicly)