Thermal effects are now a critical part of design, but how to deal with them isn’t always obvious or straightforward.
Power management has been talked about a lot recently, especially when it comes to mobile devices. But power is only a part of the issue—and perhaps not even the most important part.
Heat is the ultimate limiter. If you cannot comfortably place the device on your face or wrist, then you will not have a successful product. Controlling heat, at the micro and macro levels, is an important aspect of the overall product design.
The limitation is created at the macro level. “Heat has to get out,” says John Wilson, application engineering manager at Mentor Graphics. “When we think about handheld devices, or small form factor devices, there is only so much that can be done in terms of thermal management. All you can really do is spread the heat out or isolate it.”
Wilson stresses the importance of considering thermal management as an integral part of the design process. “The thermal engineer has to interact with the design team. It cannot be considered downstream. Historically, you could just consider worst-case power dissipation and design a cooling solution for that, but for anything that is size limited, such as a mobile device, we don’t have that luxury. We have to understand how the time constants affect the way it can be used. We have to understand how fast the device will heat up given a certain type of condition. We have to design the control scheme around this dynamic response and software has to be used to control it.”
Temperatures sensors are placed inside devices for one of two reasons. “There are individual circuits that are very heat sensitive and their behavior may be variable with temperature,” says Drew Wingard, chief technology officer at Sonics. “Then there are those that are there for the safety of the chip. These are shared across the digital circuits. Alarms go off when temperatures exceed a threshold. They cause evasive action to be taken such as voltage reduction or to slow down clocks so that you can prevent the chip from burning up.”
The sensors themselves are fairly small blocks of Hard IP customized for each process node. “Sensor IP takes advantage of the inverse temperature properties of a PN junction so that you can get a pretty good temperature curve out of it,” says Steve Carlson, low power solutions architect at Cadence. “Then it needs to be linearized and a register interface provided. Calibration capabilities may also be included.”
The sensors form part of a thermal management stack. “At the heart of it are the on-board temperature sensors which may be read individually or there could be an aggregation point on the chip,” continues Carlson. “At the hardware level you have the fail-safe level of thermal management where you shut down right now or the chip will go into thermal run-away and suffer irreversible damage.” When you hit that trigger point, the system will shutdown and it may not be graceful.
Then there is a software stack ranging from firmware to the application. “In the firmware and OS level, you typically have the trigger point set at some temperature before the thermal runaway level or at a temperature that is predicted based on a trajectory,” he explains. “Now you have time to save data and resume the system after the chip has cooled. In the OS itself there are thermal mitigation strategies that may include thread migration, register remapping, using different memories that may be physically segmented, or moving activity around from either a memory of compute standpoint to spread the heat around.”
While the accuracy of the sensors may seem to be important, this is not always the case. “It is nice to have several sensors spread out, but they don’t have to be very precise,” says Wingard. “The ones that need more precision are those where the correct operation of the circuit is dependent on it.”
Navraj Nandra, senior director of marketing for the DesignWare Analog and Mixed-Signal IP group at Synopsys, agrees. “Today’s sensors can achieve accuracies on the order of sub- to a few degrees Centigrade. This is adequate accuracy for this type of application.”
If heat is the limiting factor, then you really need the sensors where that heat is felt. “You cannot put a temperature sensor on the surface of the user-level device (the outside of the phone for example), so you have to put the sensors on the board,” points out Mentor’s Wilson. “These will provide information about what you expect to be happening on the outer surface of the device. You need to consider certain power scenarios and determine where the potential hotspots on the phone are and then place sensors on the board that will capture that information. “
That isn’t always so simple, though. “You may have eight cores in the same area, so there is no room to add those sensors,” says Aveek Sarkar, vice president of product engineering and support at Ansys. “So you put it near the L2 cache with the expectation that heat will spread to the temperature sensor. But it may not happen that fast. And with wire sizes getting thinner, you may not be able to fix an issue before it causes a local hot spot.”
“Every device implemented at 28nm and below has temperature sensors,” says Prasad Subramaniam, vice president of design technology and R&D at eSilicon. “We use Process, Voltage, Temperature (PVT) sensors that monitor process conditions including temperature. Typically, they are placed at multiple points in the chip and used in multiple ways. We also use them during manufacturing test so that we can understand the key parameters of each piece of silicon.”
On-chip, there are various placement strategies that can used. “If you have a block that is very sensitive to temperature, then you want to put the sensor as close as you can,” says Wingard. “This does not really apply to digital logic, but it is applicable to analog and mixed-signal systems. The other kind is based on activity, and layout will impact where the circuits generating the heat are. There is not a whole lot of guidance that an IP provider can give. Even if they did provide it, it would still be the system integrator who has to run the thermal analysis of the chip.”
Another factor affecting placement is related to the temperature control mechanism to be used. “Complex implementations of Dynamic Voltage and Frequency Scaling (DVFS) techniques rely on distributed temperature sensing,” says Nandra. “Using this strategy, sensors are embedded close to each heat generating processing unit (GPU, CPU, etc.) such that supply levels, clock frequencies or even loading of the individual processors can be adjusted with respect to performance and temperature. Up to tens of distributed sensors may be implemented in these cases. Often they are chained together and read-out by a single temperature processing unit.”
Time constants and thermal gradients
Part of the issue with placement is related to time constants and thermal gradients. “Placement is critical because you don’t want to have a temperature sensor for everything that could get hot,” says Carlson. “There are many different operating scenarios, and the data matters. Video playback may be hot, but it is dependent on the video stream. You also need to predict trajectory. If the design could operate at 120˚ C, you may kick in thermal mitigation at 100˚ — long before it reaches the critical level.”
How long you have to react is, in part, a matter of how far you are from the sensor. “If you are worried about heat as a destructive mechanism on a chip, and given the clock rates we are working at, then relatively small changes in temperature take thousands of clock cycles,” says Wingard. “This is bad because it means the sensor is a historical view of the activity that went on before. But it is good in that the time to react is not measured in terms of tens or hundreds of cycles. Nothing changes quickly in the thermal domain.”
The timescales at the transistor level are picoseconds, but at the chip level it could be up to seconds or even minutes. There are many orders of magnitude that may be spanned depending on what you are trying to do. Metal conducts a lot better than silicon. “Some people consider using a diamond substrate because it is a great thermal conductor and a great electrical insulator,” adds Carlson.
When extending that out to the package and user-level device, the time constants could be even longer. “As the device generates heat, it will take a certain amount of time before that will be felt on the outer surface of the device,” says Wilson. “If you start driving the phone hard, it would be seconds before you start to feel anything on the outer surface.”
Thermal maps for major processors have been published that show the hottest regions and the coolest. “These are only separated by a few degrees,” says Wingard. “Local heating does not mean that this part of the chip is 20 degrees hotter than the rest. It is not that extreme. If that were the case, there would be other problems. For example, the thermal coefficients of expansion would mean that the packages would start to break, or you would have stress problems with the connections.”
Subramaniam agrees: “Placement is an interesting problem, but in reality the temperature gradient across the die is not that much. We typically put them in the four corners and one in the center. Another approach is to divide the chip into segments and place one in each segment of the grid. It all depends on what you intend to do with the data that you get from the sensors.”
Wilson explains there are maximum power dissipation figures related to form factors and usage. “If we turn the device on and let it run full-bore, there will be a temperature that renders the device unusable in terms of internal component temperatures and the outer surface temperature. You are always going to be constrained by the outer surface temperature, which is probably 45˚C or less for comfort purposes. You will hit that limit more quickly than any of the internal component temperatures. For a mobile phone it is the order of 5 watts. That assumes 100% efficiency in spreading the heat. In reality there will be hot spots on there. If the device needs to generate more than that, it can only do it for a short amount of time. It is not sustainable.”
While everyone agrees that thermal analysis has become mandatory, the level of detail is open to interpretation. “You have to examine a lot of data, and if you look at where the wires run you can figure out how fast the temperature will travel from one region of the chip to another,” says Carlson. “There are some simplifications that can be done such as looking at metal density calculations and they work okay if you are working at the SoC level rather than at the sub-system level.”
Related issues also on the rise
There are other downsides with running the chip hot all the time. “Device reliability is important and is exponentially correlated to the temperature that you run at,” says Carlson. “If you always run hot, your wires will wear out faster than if you run at a lower temperature.”
These issues are growing in magnitude for smaller device geometries and adding to total design costs. “Thermal challenge stems from two trends,” says Tobias Bjerregaard, CEO for Teklatech. “First, scaling leads to increasing power density, which means that more heat is dissipated in a smaller area. Second, scaling also leads to faster switching times, which means that the power is used in shorter bursts. Power consumed by a resistive path is P = I^2 *R, meaning that a short burst of high current dissipates more heat in the device than a longer burst of less power. Fast devices create more heat than slow devices, even though they are moving the same amount of electrical charge.”
Power Gating is one of the most commonly used power management strategies, but this creates a design dilemma. “Being able to take evasive action in hardware means that you could run to a higher temperature before you have to take any action,” says Wingard. “The response times are shorter and guaranteed, and this is a lot easier to do in hardware than with software. Doing it with distributed hardware also means that we can handle a larger number or wider variety of sensors without generating too much overhead.”
Bjerregaard agrees, but with a caveat. “Fine-grained power gating gives you better control over thermal and total power. But finer granularity power gating also requires faster power-up times. The normative approach to handling in-rush currents is to open the power gates step-wise, but this takes longer and goes directly against the needs of fine-grained power gating.”
In-rush currents are a byproduct of solving another problem. “We basically have in-rush current spikes because we need to charge all the decaps that were added to handle dynamic power voltage drop (DVD),” explains Bjerregaard. He contends that the only way around this is more intelligent handling of DVD. “This can be done in two complementary ways—either by shaping the dynamic power waveform to reduce dynamic voltage drop without adding decap, or to add decaps only where it counts the most.”
With new Packaging technologies being increasingly investigated these days, thermal issues will become even more difficult to analyze. “For 2.5D and 3D ICs you may also have to look at the things above and below you in the stack,” says Wingard. “Power pin placement and the kinds of heat removal systems that exist can have a big impact on not just where the heat is being dissipated, but also how effectively the heat can be removed.”
Many related problems also get worse. “With higher levels of die integration come the growing challenges of package-level current spikes,” says Bjerregaard. “It becomes increasingly important to control dynamic power integrity at the die level because integrating several dies more tightly together just exaggerates the package-level dynamic power issues.”
In order to perform thermal analysis, you first have to work out what will create the worst-case conditions, something that is perhaps easier said than done. “People are fairly good at worst-case exploration, but you tend to over-design when you do this,” says Carlson. “That is expensive. For military systems, you may do a lot of this for insurance.”
Carlson explains why so many people have problems with thermal management: “Some people think that they have vectors handy from simulation and they run them for thermal analysis, but nobody quite knows what the testcases are doing. Increasingly, people want to be sure, and so they will look at macro scenarios. To do those you need a lot of throughput capability.”
Adds Carlson: “Just because thermal management kicks in is not the end of the story. You may have thermal management kicking in at certain times and in different parts of the chip, but if you wait too late and there are adjacent issues, then those temperature gradients may align and you could be too late. The mitigation schemes should let you run up to that point but ensure that you do not cross the line.”
Software-Driven Verifications are important. “It is difficult to provide guidance for Soft IP,” says Subramaniam.” One does not know how it will get hardened and it is also not clear what the activity will be in the IP because that is dependent on use model. Different applications will exercise the IP differently and thus the heat may be different.”
It is clear that dynamic analysis has to be performed, but what execution engines should be used? “Virtual platform models just do not have the necessary level of accuracy,” says Carlson. “This is because the interconnect hierarchies and the memories are missing and that prevents you from getting an accurate picture of power dissipation and what the thermal gradients are going to look like. People are swinging back towards doing cycle-accurate Emulation and running billions of cycles so that you can boot up angry birds.”
The environment that the system operates in can also create complications. “If you look at automotive, they often have to operate in a much more extreme environment because their ambient temperatures may be very different,” points out Wingard. “They may also be close to other heat sources, which means that I may not be able to control my temperature by turning myself off. I may not be the biggest heat generator.”
Wilson adds, “If you have a 125˚C limit and the device is in a 100˚C environment, then you have a much smaller margin. That means you either have to sink that heat using a conductive path to a lower temperature device, such as a cold plate, or blow a lot of air across it.”
Thermal Damage To Chips Widens
Heat issues resurface at advanced nodes, raising questions about how well semiconductors will perform over time for a variety of applications.
Thermal Issues Getting Worse
Heat is a multi-disciplinary problem, and it’s growing harder to deal with at advanced nodes.