New Approaches For Dealing With Thermal Problems

Keeping chips operating within acceptable limits is becoming more difficult and expensive.


New thermal monitoring, simulation and analysis techniques are beginning to coalesce in chips developed at leading-edge nodes and in advanced packages in order to keep those devices running at optimal temperatures.

This is particularly important in applications such as AI, automotive, data centers and 5G. Heat can kill a chip, but it also can cause more subtle effects such as premature aging over time. The problem is that it is difficult to account for all of the use cases and conditions under which those chips might operate and all of the potential interactions in complex systems.

Power management techniques have existed for years to keep chips within power and thermal budgets, but they are expensive, time-consuming, and sometimes difficult to work with. But they also are becoming more necessary, particularly in safety- and mission-critical applications. A slowdown in Moore’s Law also has brought a slowdown in power scaling. For most applications, voltages have stopped scaling altogether.

“This means that increasingly we have dark silicon,” said Steven Woo, fellow and distinguished inventor at Rambus. “You really can’t turn on every transistor on the chip and use it at the same time. This is a trend that will continue. As chips become more functional with more transistors, there will be more dark transistors, and this requires active monitoring.”

At the top end of the high-performance computing segment, there may be more use of advanced cooling technologies. This includes water or liquid cooling of some kind, where liquids are being sent over cooling plates, as well as immersion cooling techniques in inert fluids. This is still rare for mainstream data centers, but that likely will change.

Thermal issues also have driven companies to rethink chip architectures. What used to fit on one chip increasingly is being spread out across multiple chips in a package. Heat is still an issue, but it’s potentially more manageable.

3D-IC is happening because Moore’s Law is coming to a physical limit,” said CT Kao, product management director, multi-physics system analysis at Cadence. “We cannot go to 2nm or 3nm, so different functions have been added as chips and IP on a common substrate. Instead of building multiple chips in 2D in planar, we’ve started going in the third direction, and when you put all those things on one small substrate within the silicon chip, everything’s so tiny and you have hot neighbors in the next building. You want to analyze how to dissipate all that heat out no matter if it is transient (real time) or static. Vias must be build up in between the buildings, which have high thermal conductivity, so the heat will be more effectively transferred downward, through the substrate, through something beneath it such as the PCB, then out. Here, the questions include how many thermal vias need to be placed and the best spacing of those vias. These considerations matter, especially with signals switching very fast in the design and things clustered together.”

The number of possible interactions and configurations is almost limitless. “There are so many configurations and so many different partitions of the power — different blocks, and different placement of the blocks in the chip in the top die, in the bottom die, in the center die,” said Norman Chang, chief technologist for the semiconductor business unit of Ansys. “Engineering teams need a tool that can provide almost a real time thermal analysis view of the 3D IC design. It doesn’t have to be a very detailed view that uses millions or millions of finite element machines on the chip, but for 3D-IC design for the interposer, the number of thermal vias, the placement coordinates, and the diameter of the thermal via, it needs to be done in almost real-time simulation.”

Fig. 1: Predicting heat on a chip. Source: Ansys

Real-world use cases
The monitoring doesn’t end there, however. With different applications and potential interactions, thermal simulation is just the starting point. Devices need to be monitored in the field, and that involves more than just sending an alert when the temperature reaches a maximum set point. Data also need to be collected an analyzed and looped back to engineering teams to make adjustments in existing devices, where possible through programmable logic, firmware or software, as well as in future designs.

This isn’t always as straightforward as it looks. Thermal monitors need to be scattered across a die, and they all need to be working reliably and in concert. “If you sense the temperature at one area of the die, and the data converters take that and convert it into anything digital, it’s far away, and you lose accuracy,” said Suresh Andani, senior director of IP cores at Rambus. “In order to measure hotspots at several places on the chip, you absolutely need to have that.”

Rambus has been building temperature monitoring diodes into all of its IP. “You just hook an ADC to it, and then you can read out the temperature,” Andani said. “A lot of that is fed into the BMC (board management controller). If it’s going into a server, there’s a baseboard management controller that reads and takes necessary action, whether it’s voltage scaling or frequency scaling, to cool it down.”

Stephen Crosher, CEO of Moortec, contends that thermal monitoring is absolutely critical for finFET devices. “For 28nm and 22nm, it’s highly desirable. You go to 40, 65, or 90nm, you can probably design carefully enough to try and avoid having monitoring, although it depends on the application. Low power applications at large nodes may be less sensitive. But the demand from the design community is not just having one or two sensors on a finFET chip. It’s having hundreds spread across the die so they can make better assessments of power and activity across the die.”

Changes in thermal monitoring
Thermal sensing and thermal analysis have been real-time for quite some time, but they have improved significantly. Sampling rates are up and measurements are now more precise, allowing chip and system design teams to make adjustments during mission mode.

“Distributed thermal sensors can be placed quite ubiquitously across the die,” said Crosher. “As the center points are very small, this allows thermal mapping/profiling to be done across the die. Because the center points are tiny in effect, you can place them strategically and in fairly high numbers across the die, which becomes very helpful for the applications.”

Where thermal sensors are placed depends on the application, he said. “In a high-performance computing application, there are likely a few processor cores within the chip. It might be 4, it might be 16, and thermal sensors placed can be placed within each of those cores. And because the sense points are so small, they actually can be within the processor cores themselves. So you get some really localized understanding of heating effects as the software makes the cores work harder and they heat up. You can be sensing very closely.”

AI applications, where hundreds, thousands, if not tens of thousands of AI cores or accelerators are being placed is where it makes sense to group these monitors together in a cluster because they’re so numerous, Crosher continued. “They’re much smaller in terms of a standard processor because they only need to do computations that are very specialized for short periods of time to run the learning algorithms or the AI algorithms they need to do, and monitoring clusters can be anywhere from four of these repeated elements to 32 or 128.”

Automotive is another area in which engineering teams are looking for the hot areas of the die that are going to be exhibiting the highest activity and the highest temperature. This could be an interface block, a high speed SerDes or a USB interface. “The requirement there is to monitor the point that’s going to be at the highest temperature for the longest, because that’s where you get the thermal stresses, and that’s the areas that will shorten the lifetime of the chip. For automotive that’s less desirable,” he explained.

Once the measurements are made, the data is sent by telemetry. If in a mission mode, it will go to the main processor, or to the main software running on the device. It then allows the device, via software, to either slow down clock frequency or slow down clock speeds. When the activity is pulled back, the temperature drops back down again. The supply voltage can also be varied if the device is getting too hot.

Additionally, temperature can be monitored over long periods of time within the die, Crosher said. “If you can monitor not just one chip, but all chips across the product range over long periods of time, then you start to see patterns emerge. That’s where things get quite powerful. You can look across a product range and see this bunch of devices — or this product over here in this particular application, or maybe certain devices in a data center, or part of the fleet of cars in a certain country, whatever it may be — are running at high die temperatures for long periods of time, which means reliability is going to be impacted. At that point you can do something about it, which may be early maintenance, recalling the product, or treating those devices in a different way such as slowing clocks down or taking some mitigating action so that the devices can run for longer in their lifetime. That’s one of the really powerful things about this, and there are big opportunities in terms of that analysis. But the data has to be reliable and it has to be precise. Then you can make better decisions with that information.

The more that is understood about how a device is impacted by the use case, the better the understanding of where to place the sensors for the next generation design. That can include everything from understanding where the actual center point is in a hotspot to how future chips should be designed to maintain the optimal operating temperature.

Applying machine learning
Another change involves real-time modeling to verify against the thermal sensor reading on the chip. “If we can build a model based digital twin for thermal sensing, we can perform power calculation very quickly, as well as detailed thermal analysis,” said Ansys’ Chang. “If you look at the traditional finite element method running on chip with millions of mesh resolution, it may take between several hours to 15 or 20 hours for a big AI chip. How do we bypass this requirement?”

What could work here, he said, is a machine learning-based method to perform on-chip thermal analysis that can be done possibly within several seconds, or maybe in just a minute or two, for each specific workload. In this case, a particular workload means that a particular vector scenario is working for the chip at different blocks.

“This is the first need to be filled — a real-time thermal simulation solution that can set you up with a second reading for this thermal sensor coming from the chip,” said Chang. “And if we can have a second thermal analysis based on different scenarios happening during this one second, we possibly can push down to 100 milliseconds because the thermal time constant is longer than the electrical time constant, and the thermal time constant is usually in the 100 millisecond range. This thermal simulation is the first step we try to do, and we try to have a real-time thermal simulation capability using the deep neural network to construct the model.”

With a machine learning-based thermal solver approach, how then should anomaly detection be performed to check against the thermal readings coming from the chip? AI may offer some help.

Using a neural network model, the design can be checked to determine if there’s a thermal runaway if the connection of the heatsink is loose, for example. It also could be used where there’s thermal runaway due to temperature increases, which drives up the leakage current and puts the device into a thermal runaway loop.

“You will observe a local thermal runaway that can be picked up at a thermal sensor,” said Chang. “However, you don’t know if the problem is due to the thermal sensor malfunctioning or due to another problem. If you have a thermal model, a reduced-order model or a deep neural network model, that can check against the thermal reading under a specific workload. Then, for example, if the thermal sensor says this should be 85°, and the model says it should be only 65°, then you know there is something wrong with the thermal sensor and you can look into the problem. This is the value of a real-time thermal model/thermal analysis to check against the thermal reading from the thermal sensor.”

Another scenario would be where there’s a periodical due to the high power/low power scenarios interleaved with each other. This causes a drifting due to the periodical high power/low power which could also cause thermal runaway based on the periodical thermal reading.

Again, the data aspect of this type of engineering could be valuable going forward to ensure better design.

“Once you have in-chip term monitoring using thermal sensors, you can use that data to have a better design of your chip in terms of the power constraints, where to scale down the power or scale up the power so that if your thermal envelope is not reached at an optimal point, you can increase your frequency of running and get a better handle on the workload,” Chang explained. “You can do optimization of the workload handling and make real time adjustments to the VDD, as well as the frequency. For AI chips, this could help with the partition of the workload scaling, because you cannot fire up all the GPU cores at the same time. Due to the different workloads, you need to determine which part of the GPU will be turned on, and which part of the GPU or CPU needs to be turned down, because you don’t want to violate your thermal profile/constraint. With thermal sensors, and better thermal models you can optimize the design, and maximize performance to deliver handling of the workload.”

It would seem ideal to combine the best of thermal simulation and thermal in-chip monitors. The answer at present seems to be a strong, “maybe.”

“Once we have a machine learning based model, and the ability to provide real time performance and thermal simulation data, you are no longer constrained by the long simulation time needed due to the millions of finite element machines, and that you can run different kinds of workloads for different partitions of the workload,” he said. “Then, the thermal profile can be checked immediately, which will be very beneficial for designers. If you are using the traditional finite element method, it takes five to 10 hours for one workload, and that’s too long.”

All of this needs to be set in the context of other changes underway across the industry, such as the hardening of IP in packages in order to improve time to market for semi-custom designs. Chiplets are the best known of this approach, and what impact they will have on thermal management is unknown.

Chiplets developed by a single vendor already are in use by Intel and AMD. Chiplets developed by multiple vendors are on the roadmap for a variety of companies and foundries. There may be as more than a dozen chiplets in a package, connected over some interconnect, but not necessarily to each other. How those chiplets are characterized and where they are placed can have a big impact on heat.

“While the chiplet may take less power compared to a bigger chip, the area is very small, so the heat dissipation is limited,” said Rambus’ Andani. “However, the power density or thermal density for a chiplet could be a concern. Overall power may be less than the base die, but the chiplet, since it’s such a tiny piece, could become really hard. If it is an interface chiplet, and it is running 24 channels at 100Gbps, for instance, the ability to measure the temperature and to take some action with it is very critical as we go into more system-in-package types of architectures.”

Leave a Reply

(Note: This name will be displayed publicly)