DRAM Thermal Issues Reach Crisis Point

Increased transistor density and utilization are creating memory performance issues.

popularity

Within the DRAM world, thermal issues are at a crisis point. At 14nm and below, and in the most advanced packaging schemes, an entirely new metric may be needed to address the multiplier effect of how thermal density increasingly turns minor issues into major problems.

A few overheated transistors may not greatly affect reliability, but the heat generated from a few billion transistors does. This is particularly true for AI/ML/DL designs, where high utilization increases thermal dissipation, but thermal density affects every advanced node chip and package, which are used in smart phones, server chips, AR/VR, and a number of other high-performance devices. For all of them, DRAM placement and performance is now a top design consideration.

“There are degradations where we say, ‘From zero to 85°C, it operates one way, and at 85° to 90°C, it starts to change,'” noted Bill Gervasi, principal systems architect at Nantero and author of the JEDEC DDR5 NVRAM spec. “From 90° to 95°C, it starts to panic. Above 95°C, you’re going to start losing data, so you’d better start shutting the system down.”

Those numbers are based on 14nm technology, Gervasi said. He anticipates even worse to come for advanced nodes and advanced packaging. “As you scale down to 10nm, or 7nm, 5nm, or 3nm, what’s going to happen? Your linkages are out of control. You’re making yourself more susceptible to crosstalk, so rowhammer starts to become even more of a crisis. It is a very significant problem.”

One of the primary reasons for this is the basic design of DRAM. Despite the rising number of DRAM interfaces — whether it is DDR5, LPDDR5, GDDR6, HBM, or others — they all retain a fundamentally similar structure.

“The guts of a DRAM chip is basically a very tiny capacitor connected to a switch,” said Marc Greenberg, group director, product marketing at Cadence. “To write data into that cell, you allow current to flow into that capacitorl. To read data from that cell, you sense whether there’s charge on that capacitor or not.”

Unfortunately, that leads to a well-known drawback. “The charge that’s stored on those tiny capacitors is a relatively small amount of charge,” Greenberg said. “It is very sensitive to leaking away when it gets hot.”

No matter how novel the architecture, most DRAM-based memory is still at risk of performance degradation from overheating. The refresh requirements of volatile memory (as a standard metric, about once every 64 milliseconds) intensify the risk. “As you raise the temperature over about 85°C, you need to refresh that charge on the capacitors more often,” said Greenberg. “So, you’ll start moving to a more frequent refresh cycle to account for the fact that the charge is leaking out of those capacitors faster because the device is getting hotter. Unfortunately, the operation of refreshing that charge also is a current-intensive operation, which generates heat inside the DRAM. The hotter it gets, the more you have to refresh it, but then you’re going to continue to make it hotter, and the whole thing kind of falls apart.”

This is the point of no return. “If one DRAM fails due to heat/thermals, it’s very likely that others will also fail,” said Steven Woo, fellow and distinguished inventor at Rambus. “The reason is that all the DRAMs are typically near each other, so if the temperature is high, then it’s a danger to all DRAMs. Even with robust server memory systems, the loss of just a couple of DRAMs due to heat-related failures can mean that the entire system fails. So heat and thermals are a really big deal for memory systems.”

And it’s not just servers. With about 8 billion transistors on a single die, a mobile phone can get so hot it might need to spend a few minutes in a refrigerator. When that happens, the apps will fail to function correctly.

The same holds true for increasingly dense advanced packages. “Heat becomes an issue for memory, particularly when stacking techniques are used, such as SRAM on top of logic,” said Victor Moroz, a Synopsys fellow. “When you do that, there are implications because that’s when it gets heat overflow from adjacent logic, and that’s a bad thing for memory — for SRAM, not so much, but for DRAM it’s a big deal because this refresh time is exponentially dependent on temperature because it’s a junction leakage. When you put DRAM in the same package as logic, and if it is logic for high-performance computing, then the DRAM will suffer. Your refresh time shrinks and you have to refresh it more often.”

Demands for heat tolerances have gone up over the years. “When I first joined the company, 0° or minus 40°C might have been the low end, with the high end at 100° or 110°C,” Woo said. “But these days, the automotive industry requires some of the most extreme temperature guarantees.”

Hotter temperatures can lead to higher refresh rates, which degrade performance, especially in data-heavy applications. “In some cases, if the temperature gets to be near the top-end of the accepted operating range, a system might choose to increase the refresh rate of the DRAMs,” he said. “The time that a DRAM retains its data depends on the temperature, and at higher temperatures the refresh rate might need to increase in order to make sure data isn’t lost. A higher refresh rate means that we’re taking away some of the bandwidth of the DRAM, so performance of the system may be impacted at higher refresh rates.”

This has to be baked into the design. “For example, if you’re designing an I/O controller, you have this data stream that’s being flung at you that you need to absorb,” Gervasi explained. “In the DRAM world, which is where all line cards are designed today, if that DRAM is in refresh for 350 nanoseconds, that memory is offline. But the data stream is not going to stop. That means you have to design your whole architecture around buffering data for 350 nanoseconds before you can start to empty that buffer again.”

Attempts to adjust refresh rates lead to unhappy tradeoffs. “Five percent of system performance is now dedicated just to keeping what you already wrote,” said Gervasi. “Is that a solution? Apparently it is, because that’s what people have to do if they want to run north of 85°C — give up some system performance in order to get data integrity.”

Memory choice matters
In response to these concerns, the semiconductor ecosystem is trying a number of solutions to minimize thermal issues and increase reliability. LPDDR tackles the refresh problem by incorporating a feature called “temperature compensated self-refresh,” noted Randy White, memory solutions program manager at Keysight. “As you need to refresh your memory banks, you have a built-in temperature sensor on the die. There’s a lookup table that said, ‘For every one degree that you increase in core temperature, you need to increase the frequency of your refresh cycle proportionately.’ Similarly, DDR5 DRAMs now include an internal temperature sensor. It’s difficult to design a precise, on-die temp sensor, so it’s only as accurate as +/-5°C. But it’s better than nothing, which is what existed for DDR4. This will at least help to know when to turn fans on, and generally tell how effective air flow design is working.”

At the standards level, JEDEC has been experimenting with possible fixes, Gervasi said, “We’ve put in thermal trip points inside of the DRAMs, and have discussed the possibility of having a backdoor access port in the next generation, where the DRAM can say, ‘I’m getting way too hot here. You need to do something. Either slow down the data access or speed up the fans.’”

A popular method already in the market is to build error correction into chips, Greenberg said. “In the more advanced DRAM types, the very dense ones like LPDDR5 and DDR5, the memory manufacturers are implementing on-die error correction. When a bit becomes unreadable because its charge leaked out, there’s error correction circuitry on the DRAM device that’s capable of correcting that error by piecing together what data should have been in that bit cell from the other bit cells around it, as well as some error correction bits, which are also included in the DRAM die.”

This technique has allowed memory manufacturers to offer extended-temperature range DRAM. Many approaches are based on Hamming codes, an error correcting scheme that dates back to the days of punch tapes, but still helps correct one error and detect two errors. More advanced approaches also have entered the marketplace. Of course, no one’s going to reveal their proprietary algorithms, but in a previous blog post, Vadhiraj Sankaranarayanan, senior technical marketing manager at Synopsys, gave a high-level overview of DRAM error correction.

Cadence and others also offer additional correction beyond what would already be on an ECC for high reliability applications.

One technique that has been teasing the industry for over a decade is microfluidic cooling. Along with the standard commercial cooling elements of heat sinks, fans, or external liquid cooling, ongoing experiments from academic labs are incorporating cooling directly into chips, a method called integrated microfluidic cooling, in which microfluidic channels are etched into a chip, allowing cooling liquid to flow through it.

Although it sounds like a near-perfect solution in theory, and has been shown to work in labs, John Parry, industry lead, electronics and semiconductor at Siemens Digital Industries Software, noted that it’s unlikely to work in commercial production. “You’ve got everything from erosion by the fluid to issues with, of course, leaks because you’re dealing with extremely small, very fine physical geometry. And they are pumped. One of the features that we typically find has the lowest reliability associated with it are electromechanical devices like fans and pumps, so you end up with complexity in a number of different directions.”

Different approaches
A radically re-thought memory design that did successfully make it out of the lab is Nantero’s NRAM. It isn’t a DRAM, but rather a nonvolatile chip made from carbon nanotubes and has shown it can tolerate extreme thermal conditions. The proof of concept: It was tested in space, on the Space Shuttle mission that repaired the Hubble telescope, Gervasi noted.

For JEDEC, Gervasi is developing specs that should allow NRAM chips to seamlessly slot in for DRAM. But regardless of NRAM’s ultimate success, he believes that carbon, at least, offers a way out of the thermal conundrum. “Carbon nanotubes are rolled up diamonds. They’re almost a thermal distribution. They are actually going to be deployed, even if they don’t use them as memory cells, because it’s such a great way to do heat spreading and heat distribution. Carbon nanotubes also are being discussed for use in printed circuit board routing, or routing on chips, because it is so perfect in terms of thermal distribution.”

Whatever chip and other components are chosen, it’s essential to shift left and simulate thermal issues in the design phase, and not treat them as an inconvenience that can be fixed later, Greenberg said. “You definitely have to consider how hot things are going to get. That’s often an afterthought. People just assume that to do the computation job that you have to do, there’s always a bigger heatsink that you can buy. Those making battery-operated devices, cell phones, tablets, and watches are concerned about the power consumption, not so much the heat. A lot of simulation techniques can be employed both to improve power consumption and improve the thermal situation, as well.”

Pre-production simulation, of course, needs to be paired with post-production physical analysis, specifically, testing chips and binning them according to how they perform. “You really want to build one design, if you can, because that allows you to have economies of scale,” Rambus’ Woo said. “And then you might want to test it according to different specs. The test flow is when you get a chance to say, ‘This device actually covers a really broad range, so maybe we can sell it into the automotive market.”

Finally, if worse comes to absolute worst, the spec can be changed, but that could be a disaster for certain use cases, such as mobile devices. By contrast, allowing a temperature increase for chips in large data centers could have surprising environmental benefits. To this point, Keysight’s White recalled that a company once requested JEDEC increase the spec for an operating temperature by five degrees. The estimate of the potential savings was stunning. Based on how much energy they consumed annually for cooling, they calculated a five degree change could translate to shutting down three coal power plants per year. JEDEC ultimately compromised on the suggestion.

Related
Keeping IC Packages Cool
Engineers are finding ways to effectively thermally dissipate heat from complex modules.
DRAM Chips That Employ On-Die Error Correction & Related Reliability Techniques
Future Challenges For Advanced Packaging
OSATs are wrestling with a slew of issues, including warpage, thermal mismatch, heterogeneous integration, and thinner lines and spaces.
Thermal Floorplanning For Chips
Many factors influence how hot a die or IP will get, but if thermal analysis is not done, it can result in dead or under-performing systems.
Mapping Heat Across A System
Addressing heat issues requires a combination of more tools, strategies for removing that heat, and more accurate thermal analysis early in the design flow.



5 comments

Simon says:

Considering were not even really trying to actively cool down DRAM modules atm, I don’t see the problem being such a big deal yet.

This might complicate GPU cooler designs tho if it affects VRAM too.

Obviously silly says:

OK. So start selling dimms with heatpipes and heatsink fans

Geeeeeee says:

I was struggling with an RTX 3080 TI for a while. I replaced the cheap thermal pads on the GDDR6X chips with a thin layer of quality thermal paste and best quality thin thermal pads. The reported temps even during stress like ETH mining went from 105c to 85c. The GPU manufacturers need to read this article and understand that they need to improve their VRAM cooling designs even if it costs them a couple of dollars.
Perhaps the chip manufacturers need to upgrade the cooling of the chips themselves with a heat spreader?

David Leary says:

An interesting and informative article, thank you Karen. From my experience there is a broadly held opinion in the industry that DRAM bit cell capacitor leakage at elevated temperature invalidates bias-temperature stress testing (HTOL) and may even damage DRAM bit cells. This has not been my experience. Modern DRAM, including HBM, can be safely functionally stress-tested at elevated temperature (eg > 125C), applying standard memory patterns running above 10 MHz.

Cox says:

As someone that has worked on DRAM architecture for the last 25+ years, this last statement “Modern DRAM, including HBM, can be safely functionally stress-tested at elevated temperature (eg>125c)” is very misleading. While a DRAM may actually not stop operating, the primary usage (which is to retain data) would be SIGNIFICANTLY, and I mean SIGNIFICANTLY impacted above 105C let alone 125C. That is, of course, unless you are doing nothing but refreshing the memory non-stop….which makes it a little hard to read or write to.

Leave a Reply


(Note: This name will be displayed publicly)