Dealing With Heat In Near-Memory Compute Architectures

Shortening data paths between processor and memory may help, but not always.


The explosion in data forcing chipmakers to get much more granular about where logic and memory are placed on a die, how data is partitioned and prioritized to utilize those resources, and what the thermal impact will be if they are moved closer together on a die or in a package.

For more than a decade, the industry has faced a basic problem — moving data can be more resource-intensive than actually computing the data. There are several key variables that need to be considered:

  • Higher data volume and longer distances between memory and processor often result in lower performance and more heat;
  • On-chip and in-package temperature can vary based upon how many processing elements are used, and how often data needs to move back and forth between processor and memory, and
  • Wire diameters and different materials can either speed up or slow the movement of data.

In most cases, shortening the distance between processing and memory can have a significant impact on performance and on heat. Still, none of this comes for free.

“Those two go hand-in-hand,” said Ron Lowman, strategic marketing manager for IoT at Synopsys. “The reason people are looking at in-memory compute is for exactly those reasons. Still, everybody’s realized that they’re going to have to liquid cool these AI accelerators in the data center. From a thermal perspective, there’s a big cost there. But there’s also huge power savings for in-memory compute. The whole idea is pervasive in the industry, because depending on the algorithm you’re using and the processing that you’re using for AI, well over 20% of the power budget can be just accessing memory, and that impacts the power consumption as well as the cost of the total implementation.”

The question design teams need to consider is what are they trying to optimize for a particular application or use case. For example, the needs for an AI system are very different than for a system that contains some AI functionality. And it’s much different in smart phone, where reaction time of a few milliseconds may be acceptable, than in a safety-critical application such as an autonomous vehicle or missile guidance system, where real-time response is essential.

“If you’re doing any kind of compute in a memory device, it’s going to cause some heating, so you have to get to the right balance,” said Steven Woo, fellow and distinguished inventor at Rambus. “But people are lured in. In a CPU today, the aggregate amount of silicon area across all the DRAM chips is so much larger than the area of a CPU. So you’re tempted, because you have so much more area to do all this in. But this thing you think you can take advantage of actually becomes harder because airflow is notoriously difficult in or near memory.”

Put simply, shrinking the distance between processor and memory involves a series of tradeoffs, and those tradeoffs can be highly domain-specific. “I look at near-memory compute as how close you can get traditional logic and memory process technologies together,” said Steve Pawlowski, vice president of advanced computing solutions at Micron. “The closest would be you slap them together and hybrid bond. Pins are expensive, so if you can get the near-memory compute where there is a piece of silicon with the memory right on top, you can take advantage of the width of the memory and minimize the amount of data movement between the memory and the logic to get extremely high bandwidth and low efficiencies.”

But the total cost on a chip, or a system, needs to be fully understood. “If I have to normalize the power consumption between memory accesses versus compute, I know for a fact that it is orders of magnitude higher to get the data to the compute elements than the actual compute operation itself,” said Ramesh Chettuvetty, senior director at Infineon Technologies. “So that would mean that people have to find ways to reduce the number of data movements back and forth between memory and compute elements. But even with HBM, or other architectures that interface with HBM, they still have hundreds of watts of power consumed for the peak operations, so it mandates cooling techniques.”

Other issues
This adds a whole new element to floor-planning, and in advanced packages it frequently involves an understanding of proximity effects and how much heat or noise those components will generate, and how it will be dealt with in the context of uses cases and other elements.

“You don’t want to put data in one corner of the die, and then have it go completely across to other side of the die to be used. Then you’re burning power, and wires don’t scale,” Pawlowski said. “But there’s also 40 years of software all over the planet that wasn’t written to near-memory compute specs. With AI, by comparison, architecture and software are optimized for each other.”

Even in systems built from the ground up for near-memory compute, which theoretically should be able to deal with all of these issues, there are challenges. “AI chips require massive amounts of memories that are closely integrated,” said Preeti Gupta, director, product management at Ansys. “The only way that these systems can achieve their goals is by integrating multiple dies closer together, whether 2.5D, which uses interposers substrate, or 3D, with dies stacked on top of each other. With 3D, the thermal problem exacerbates because the heat gets trapped between the dies and cannot escape as readily.”

For design and thermal engineers, the resulting Jenga-like structures are painful to contemplate, with their coupled and cascading physics effects. “There’s a lot of modeling that is required to understand the temperature profiles,” Gupta said. “To be able to understand airflow around the system, you have to be able to model the effect of temperature, not just on power consumption, but also on the mechanical part at the time. For example, the package may warp. So, there is electrical, there is mechanical, there is computational fluid dynamics. You cannot just look at mechanical in isolation. You have to also incorporate the impact of thermal on mechanical stress, encompassing multi-physics.”

New approaches
These issues are creating a to-do list for engineers and physicists. In fact, because engineering teams keep trying to place components closer together, which causes thermal problems, do architectures need to be reconsidered?

“The rethinking of architectures is what is getting us to the near-memory or in-memory compute in first place,” said Chettuvetty. “There are several architecture techniques that are already being deployed, like cache-coherent architectures. You want to partition the caches, such that multiple cores can share the caches. And these caches are synchronized, by architecture, to make sure that the data dependencies are already taken care. Those architectural-level changes are being deployed currently, in multi-core environments. But there are still bottlenecks.”

For example, in AI inferencing, there is no way to store the number of weights required on an SoC, which could be as many as 80 million, in an embedded fashion, so connected memory must be used.

“Most of the time, we should have a very efficient data flow architecture in items using a memory controller,” Chettuvetty said. “If you rely on traditional conventional monolithic architectures, the compute and storage elements are separate. In that case, enormous amounts of memory are needed, which cannot be realized by embedded means at this point, so we will have to rely on external memory. There, the only option is to bring it as close as possible so that the capacitances’ drag on the interfaces are extremely low. That means I can lower the voltages on the lines. If I bring down the voltages and the swings are limited, then I consume that much less power. I can bring down the capacitance, and if I can bring down the wattage strings on these interfaces, I can reduce the power consumption that much. Those are techniques design teams are exploring on most of the high speed interfaces.”

Stacking effects
Across the industry, there is increasing focus at the leading edge on stacking die, whether that is done in 2.5D, 3D, or in fan-outs with pillars. In some cases, there are even 2.5D and 3D-ICs being packaged together. In all of these, the goal is to shorten distances of critical data paths and to improve throughput.

“Thermal issues are going to become more prevalent as we adopt 2.5D and 3D packaging,” said Synopsys’ Lowman. “We’ve seen a big uptick. We’ve introduced technologies, like high-bandwidth memory. It’s hugely advantageous because we’re running out of pins to increase bandwidth on traditional DDR and GDDR. HBM provides parallelism. So being able to put a stacked memory on top of that already has proven extremely beneficial, and we’re going to continue to see that adoption increase. While it is costly technology to implement, if you need performance, that’s where you’re going to have to go. For AI, you have to adopt technologies like that. We also have die-to-die technologies, because on-chip SRAM is very important for AI or on chip memories. They’ve forgone having DRAM in the system, so what they do is connect chips with lots of on-chip memory, and lots of AI compute elements together. They do that via die-to-die technology to increase performance. While this started with AI, we are seeing it migrate to server chips, as well as on the latest PC architectures. That will continue to expand, but there are thermal issues with 3D packaging. It’s an engineering field that should continue to grow.”

Further, AI/ML power requirements and the consequent architectures may usher in more thinking about how to actively cool DIMM modules. “In the past, we’ve seen a lot of forced air using for cooling, but the heat capacity of water is just so much better than air,” said Rambus’ Woo. “There will likely be broader adoption of liquid cooling, but the immersion ingredients are expensive because they’ve got to be non-corrosive.”

Different riffs
Indeed, thermal re-thinking extends not only from both cooling to basic architecture, but to who’s invited to create new approaches.

“The lines are blurring between chips and system design,” said Gupta. “These are not two disparate teams anymore. They must work together, which leads to a need for open, extensible platforms.”

For example, the 7nm IBM Telum microprocessor, which integrated AI capabilities, presented a re-defined cache architecture. The microprocessor contains 8 processor cores, clocked at over 5GHz, with each core supported by a redesigned 32MB private Level-2 cache. The Level-2 caches interact to form a 256MB virtual Level-3 and 2GB Level-4 cache.

“AI is a very compute intensive activity and therefore a power intensive activity,” said Christian Jacobi, distinguished engineer and chief architect for microprocessors at IBM.  “The way we’re doing this on these systems is by integrating it into the processor chip, we reduce some of the energy cost of doing AI, because we can access the data where it already lives. I don’t need to take that data and move it somewhere else, move it to a different device, or move it across a network or move it through a PCI interface to an I/O attached adapter. Instead, I have my localized AI engine, and I can access the data there, so at least we can reduce that overhead of getting the data to the computer and back. Thus, there’s a power efficiency that comes from being able to run massive amounts of workload consolidated on z16 and LinuxONE systems, and how the integrated AI accelerator helps with the power efficiency of those new workload components in the context of the traditional workload components.”

According to Jacobi, this achievement required working closely with the power supply team and the thermal team to develop advanced power supply and thermal solutions. “We’re investigating and developing new technology to extract heat from the chips. We have a heatsink on the processor, extract the heat with water, and then exchange that water heat with the data center. For the future, we’re optimizing the thermal interface between the chip and heat sinks to get more efficient cooling capabilities.”

Other ideas under consideration include shifting workloads among interconnected data centers, depending on both internal circumstances like processing overloads, and external circumstances like heatwaves. There are also approaches in place, like power system management, which turns off or down those parts of chips that are not actively needed. This strategy is very visible in smart phones, where the display powers off when the user is not looking at it.

But even the most well-balanced systems could be vulnerable to power viruses and the consequent thermal stress, Jacobi noted.

While near-memory compute reduces the distance that data travels, and can reduce the amount of data that needs to be sent longer distances, it’s not the only solution. And in some cases, it may not be the best solution.

The challenge is there are so many pieces that potentially can interact in a complex design, that they need to be considered in the context of the entire system.

“If you look at this holistic system with multiple components included, and then fire each of them with their power and check the thermal conduction and follow the physics, that gives at least a first-order approximation of how the heat is generated, where it goes, and how much temperature there’s going to be on a particular surface,” said Lang Lin, principal product manager for Ansys. “Simulation can at least estimate it in the right way.”

Leave a Reply

(Note: This name will be displayed publicly)