Retaining data in memories and processors becomes more difficult as temperatures rise in advanced packages and under heavy workloads.
Heat is becoming a much bigger problem in advanced-node chips and packages, causing critical electrons to leak out of DRAM, timing and reliability issues in 3D-ICs, and accelerated aging that are unique to different workloads.
All types of circuitry are vulnerable to thermal effects. It can slow the movement of the electrons through wires, cause electromigration that shortens the lifespan of a device, and it can impact the switching cycles of the transistors. And in 3D-ICs and other densely-packaged devices, it can even disrupt computations.
“Each of these impacts can have impacts on the ability to properly transmit the intended signals across a chiplet and assembly package, typically resulting in the signals coming in slower than the system was designed for,” noted John Ferguson, product management director of Calibre nmDRC applications at Siemens EDA. “Ultimately, this leads to unintended information being transmitted.”
In DRAM, heat can limit the ability of capacitors to hold electrons, which are essential for determining if there’s a one or a zero.
“Certain cells having electrons means it’s a one, and certain cells having electrons means it’s a zero,” explained Steven Woo, fellow and distinguished inventor at Rambus. “DRAMs require refresh operations because it turns out that when you just put electrons on a capacitor, over time they’re going to leave the capacitor. You can imagine that if I stored a bunch of electrons on this capacitor, as the temperature goes up, it makes it easier for these electrons to leave. So these days when you’re talking about having billions of capacitors on a chip to store all of your data, each of those capacitors is now smaller, and there’s actually fewer electrons that are used to represent a one or a zero.”
This wasn’t a big deal 30 years ago when the capacitors were big, and there were many thousands of electrons on a bit cell.
“You could lose a couple electrons and you wouldn’t really notice,” Woo said. “These days, when you’ve got dozens of electrons in some cells. If you start to lose a few, potentially now you’re getting closer to the critical level of electrons that you need. As the temperature goes up, it becomes easier to lose these electrons. As a result, some memories — like LPDDR, for example, along with other memories — have to be refreshed more often as the temperature gets above a certain level, because it becomes easier for the electrons to come off. This is part of the reason why DRAMs have thermal sensors now. It’s to help figure out whether the refresh rate needs to be increased or not. That thermal sensor/controller is monitoring what’s going on to figure out if I should be worried.”
Lang Lin, principal product manager at Ansys, agreed. “If you heat up the memory, it’s a piece of silicon — it’s circuits. Under high temperature, all electrical properties are changed. You’re going to have very high leakage because as temperature increases, leakage exponentially increases. With that, the memory retention is based on some latch flops. The latch holds the data, so thermal could cause a lot of reliability issues. It might make a latch malfunction. That’s the thermal impact. There’s also a mechanical impact. Where you have a memory chip under very high temperature it could cause stress. Any material under high temperature is impacted. Electromigration could be a factor, as well. High current flowing in the wire would cause reliability issues, and that might make the memory fail.”
Lin said these types of extreme conditions require reliability simulation. “A particular condition might not be always happening, but under high temperature and high pressure — even under high workload situations — it may. The phone was dropped, it’s a big shock. Extreme conditions like this could be the cause of many problems that impact the performance of the chip.”
One of the solutions to reduced electron retention is on-die error correction. In the past, this used to be an option that added to the cost of a computer. In many applications, it’s now a requirement.
DRAMs today include additional storage cells that the rest of the system doesn’t get access to, said Woo. “These extra bits are stored on the DRAM,” he said. “When it comes time for you to read some data from the DRAM, it reads the data, but it also reads some of these extra bits, which are error-correcting bits. They could do things like calculate check sums or syndromes that decide, in the data packet, if any of the bits flipped by accident. If there are enough extra bits, you can determine if there’s an error, and you can correct the error in the vast majority of cases. Also, there’s always the chance that every bit flips. That could happen, in theory, and then you would overwhelm the additional bits that you have. So you look at your device and figure out in the typical worst case — how many additional bits you will need, and then have that many additional extra bits stored in the DRAM.”
Heat and resistivity
As chips are powered up, electrons are pushed through very thin wires. The longer or narrower the wires, the more power required to drive signals, and the greater the resistivity. That creates heat, which in turn slows the movement of electrons and causes timing issues. Getting rid of that heat has become an increasingly thorny problem, particularly in advanced packages.
“In traditional IC design, this challenge is not too difficult,” said Ferguson. “Typically, the path for heat dissipation is relatively close. Generally, careful use of design rules and constraints can help ensure that the wires are wide enough and spaced far enough from one another to prevent serious problems. This is one of the reasons why there will typically be different design rule constraints if targeting a high-performance chip, where you are willing to sacrifice on power, versus chips targeting low-power where you are willing to sacrifice on performance.”
As more chips are decomposed into chiplets and assembled both horizontally and vertically into 3D-ICs, heat dissipation becomes more complex.
“We know that stacking chiplets can be used as a way to reduce the power through closer vertical connections, potentially lowering the heat generated,” Ferguson said. “On the other hand, some of the power paths may be longer, stretching to chiplets placed laterally, or paths may be narrower such as when passing through the narrow wires between stacked dies using hybrid bonding techniques. Also, new materials are introduced in the form of package laminates, TSVs, bumps, balls, and thermal insulator materials. Each of these materials will have different responses to the generated heat. There is also the increase in the paths for thermal dissipation. This is especially true for dies that are stacked, making it difficult for heat on an upper layer chiplet to escape. The result is that it is no longer possible to rely on the historically simple solutions that applied to the standalone IC.”
That adds some new challenges. How do you know if there is a heat-related problem, and what can you do to fix it? And how do you prevent that from happening in the first place?
Thermal analysis is one approach. “Thermal analysis capabilities have been around for decades and have been a relied upon work-horse for the package and board level designs,” Ferguson said. “Modern thermal analysis adds another level of insight by bringing in the details of the chiplet, allowing more accurate thermal analysis that can account for the heat generated in the active dies and providing the ability to tie the thermal impacts back to electrical behavior to understand whether the design will behave electrically as intended.”
This is hardly a simple analysis, though. “There is a tremendous amount of data that chip designers have to deal with when they are designing a chip and making sure that the thermal concerns are addressed,” said Suhail Saif, principal product manager at Ansys. “They must also make sure all the hotspots are found early on and addressed before it’s too late. When that whole rigorous design process is going on in chip design houses, the amount of data they have to deal with really depends on how versatile that chip is. In a very simplistic example, say a chip is being designed for a particular application in your home refrigerator. It’s a comparatively simple chip, and it has very limited functions to do in a very stable environment, meaning the environment is not going to change much. It’s going to be in a house, and the ambient temperature ranges are going to be within the limits. For such a simple application chip, the application workloads are going to be limited and few and simple. The data requirements include the data you need to carry from vector to power analysis, different transients, as well as average power analysis or long window, then di/dt analysis, thermal analysis or mechanical stress analysis, etc. For this application, all the data is very limited, and the chip is typically not that thermally challenging.”
Now compare that with a chip used in an automobile. “In an extreme scenario of an automotive chip, like the microcontroller which controls the main motherboard of a car, ambient temperatures are going to be everywhere,” Saif said. “You will drive the car from Alaska to Arizona. And the chip has to perform very-high-speed applications if it is processing the video from the camera on the dashboard, to all the sensors of the car, to everything else. Particularly with self-driving cars, those applications are super data-hungry. The chip has to perform many tasks at the same time, and that’s why it’s so much bigger too. Here, the thermal constraints are going to be through the roof, and the real-world applications that you need to consider in order to cover the range of scenarios are going to be numerous compared to the simple scenario. Here, there is a lot of data to deal with, and a lot of analysis has to be done. Data processing will heavily leverage the network. There will be EDA tools processing and crunching that data in and out. Everything is so data-prone now. You have to deploy the solutions at each stage that are the most efficient in terms of data handling, because you’re dealing with so much data due to the application of your chip.”
Fig. 1: Heat distribution in a multi-die 3D-IC system. Source: Ansys
Indeed, it’s completely application dependent. “Designers would love to minimize data because it is the biggest enemy as well as the biggest friend,” he said. “It helps them make the right decision. But then, as the amount of data increases exponentially, it becomes very difficult to handle that data and still keep the analysis meaningful. You can get lost in data really fast. I’ve seen some of the power and thermal data files from cutting-edge GPU designs. The human mind can’t grasp the amount of data that’s there. You have to rely on high-end EDA solutions to distill that data into meaningful reports a team member can understand, and then take the next step.”
What to do about thermal problems
Once thermal issues are identified, the next question is what to do about them. With 3D-ICs, there are several approaches.
In some cases, users have choices with respect to the materials used in their package assembly. “When this is the case, some tradeoffs can be made by swapping one material for another in order to better address the heat impacts,” Ferguson said. “As the industry moves toward heterogeneous design kits, however, this may become more difficult, given a reduction in the choices available. Another approach is to reconsider the placements of the individual chiplets. Making decisions on which chiplets are lower in the stack versus higher and/or making placements decisions to best reduce the paths of power can help. Finally, there are also steps that can be taken in the chiplets themselves, such as introducing copper pillars to help more quickly pull heat from an upper level placed chiplet out to a heat sink.”
The biggest problem with all of these approaches is that it can take months of careful design to detect problems, and by then it is too late to intercede to make the corrections in time to launch the targeted product. This is why there is so much emphasis on early analysis and iteration. Whatever can be shifted left may help considerably.
“In initial stages, just as has been done historically by package designers, chiplets can be considered as uniform materials and users can make simplified settings for the power,” Ferguson noted. “Doing this at the early floor-planning stages can help to prevent obvious thermal hot-spots in the 3D-IC. This approach also can be used to make decisions between competing 3D-IC approaches and stacking configurations. Various forms of automation, including different AI approaches could be applied at this level. As the design begins to mature, more information on the chiplet geometries and materials can be applied. And as the power maps start to mature, more and more accurate assessments can be made to provide feedback to the designer before a catastrophe hits. When chiplets have full circuitry in place, then post-layout netlists can be extracted with thermal properties associated, allowing for initial assessment of the electrical impacts due to the heat to determine if any signals have been pushed out of spec. By following these approaches, various course corrections can be made along the way, ultimately leading to final sign-off to confirm both temperature and electrical specifications have all been properly met.”
System tradeoffs
Tradeoffs can vary significantly, depending on the application. Power-hungry workloads, such as NVIDIA Grace Blackwell-type servers, are being developed largely to deal with huge amounts of data and rapid AI multiply/accumulate computations. In those cases, getting enough power consistently to the MAC processing elements is critical, which in turn requires a system-level understanding of how the thermal gradients will shift.
“When you’re really pushing performance, you have to worry not only about the correct transmission of data, but you also have to worry about delivering the right amount of power,” Rambus’ Woo said. “Power delivery is becoming a much more difficult problem as you have to deliver more and more power. But you also have to deliver high-quality power, which means the voltage can’t be jumping around. It’s got to be very consistent. Sometimes what you have to do on the power side of things is convert the voltages, but in the act of converting a voltage, you end up losing some power, because it’s just inefficient. That is another concern.”
Anytime the chip is running faster and faster, the correct transmission of data becomes more difficult. Heat can disrupt or distort timing, and memories can leak electrons. So in addition to on-die error correction, there can be error correction between the DRAM and the host using additional bits. “Extra bits will help you figure out if something went wrong in the transmission of the data back to the host,” Woo said. “It’s the host or the memory controller’s responsibility to look at all of these bits and say, ‘I know I encoded it with this particular error correction algorithm. The data that comes back, I sent through a decoder. It had better look right, and if it doesn’t, I have enough extra bits to correct some number of errors, the maximum I would reasonably expect in a system.’ There are cases where you exceed that. In some cases, you can actually detect when you’ve gone over the maximum number of errors. In other cases, you can’t.”
The big concern there is silent data corruption. “That can happen on die, and it can happen when the data is transmitted back across the data lake and even on the CPU,” Woo said. “When the CPU is operating on data, there can be issues where the data gets corrupted even on the processor and with a lot of these things involving temperature. When you’re communicating data between a DRAM back to the host, the host has the circuits and is trying to figure out when to sample the wire for low or a high voltage to determine if it’s a one or a zero.”
The time to sample moves around with temperature and voltage variations, transforming this into a mammoth challenge. “It gets even worse, because if there are lots of other things happening on your processor, they could be dragging the voltage or generating a lot of heat,” Woo explained. “Then, the receiving circuits, even though they’re not generating that much heat or they’re not taking up much voltage, could be affected by the neighboring circuits. That’s another way temperature can affect the transmission of the data back and forth, and that affects the DRAM as well as the processor side of things.”
Related Reading
Chip Aging Becoming Key Factor In Data Center Economics
Rising thermal density, higher compute demands, and more custom devices can impact processing at multiple levels.
Design Flow Challenged By 3D-IC Process, Thermal Variation
Rethinking traditional workflows by shifting left can help solve persistent problems caused by process and thermal variations.
It was mentioned in the article that variable refresh rate is one tool for mitigating the effect of temperature on DRAM data retention. Has variable power supply voltage also been considered? Adjusting voltage could address short term workload demands, and over a longer time frame could mitigate transistor and EM aging. Variable power would certainly necessitate a robust design for thermal management, as well as DFR. But guaranteeing 10- year lifetime may leave one with few choices.