Can thermal effects be brought under control? If so, how, and at what cost?
Modeling on-chip thermal characteristics and chip-package interactions is becoming much more critical for advanced designs, but how to get there isn’t always clear.
Every chip, based on its target application, has a thermal design power (TDP) target. This is the typical power it can consume without overreaching the acceptable thermal limits in its intended environment. But in order to rate the TDP of a chip, factors such as maximum allowable temperature, and thermal impact on reliability must be well understood.
“Die temperature affects several factors in the chip such as interconnect resistance, mechanical stress, mean time to failure (electromigration) and maximum operating frequency,” said Arvind Shanmugavel, senior director, application engineering at Ansys. “Understanding the impact of temperature on these parameters requires complex multi-physics simulations. For example, the mean time to failure has an exponential dependence on temperature. Electromigration (EM) closure needs to be done with a proper awareness of the spatial temperature of the die. Similarly, temperature affects the resistance of the metal interconnects on the chip.”
He said that performing a true thermal-aware resistance extraction can help predict accurate voltage drops on the chip. And given that temperature also affects the device performance and long term reliability, understanding these interactions with temperature is necessary for simulating reliable circuit models and better predictive analysis.
In addition, requirements specific to the application where temperature is concerned are very important to understand when designing an IC. Mobile ICs have a short operating lifespan, typically two to three years, with a very slim thermal operating margin due to passively cooled enclosures. High-performance computing ICs have a longer operating lifetime—sometimes up to 10 years—but they do have active cooling with heat sinks and fans. In contrast, automotive ICs operate in the harshest thermal environments, especially the ECU electronic components that are exposed to more than 150° C.
“Understanding the end application is critical to performing reliability simulations for these ICs,” Shanmugavel said. But bringing temperature under control is not a trivial task, and he believes a systematic approach should be taken to understanding power consumption, operating environments and reliability requirements. “Starting from an early architectural stage, chip designers need to predict the power profile of the chip for real time vectors. Using these power profiles as guidance, more accurate thermal simulations can be performed with the appropriate activities.”
On one hand, accurate on-chip thermal simulations require proper modeling of the dielectrics and metal interconnect properties, along with the package layout. On the other hand, system-level thermal analysis requires the use of computational fluid dynamic (CFD) type simulations to model air flow and proper boundary conditions.
Coping, rather than controlling
Thermal effects will never go away, no matter how good the design. That makes the best option bringing them under control.
“Previous thermal drivers such as increasing clock speed and switching losses have now been surpassed by increasing functional density and corresponding increases in power density,” said Robin Bornoff, market development manager at Mentor Graphics. These trends have kept thermal management consideration at the forefront of design and will continue to do so.”
Until such time as viable superconductors and superconducting semiconductor materials become available — which is not likely anytime soon — power will be dissipated as part of electrical and electronic processes, Bornoff said. “This power leads to an increase in temperature. Ensuring that temperatures do not exceed maximum rated values involves channeling that heat power away from its source as effectively as possible. There will always be a biggest thermal bottleneck in the heat removal path. Thermal design seeks to remove those bottlenecks in a cost- effective manner, be it employing more thermally efficient thermal interface materials, using heat pipes to transfer the heat to a peripheral heatsink, ducting air through heatsinks, or using a fan to increase the heat transfer effectiveness of heatsinks. There are a plethora of thermal management approaches that can be adopted.”
On the materials front, there is an alternative approach—using materials with higher rated maximum temperatures.
“For power semiconductors, GaN and SiC can operate at much higher temperatures than Si,” Bornoff said. “Apart from the increased cost and difficulties of such emerging technologies, they themselves require subsequent modifications to their packaging due to these increased temperatures, such as sintered die attaching. Bond wire thermo-mechanical failure at increased temperatures and temperature gradients might be addressed by clip bonding. PCB laminates start to transform into a plastic state once their glass transition temperature (Tg) is reached. High Tg alternatives to standard FR-4 are available that delay this transition to higher temperatures.”
Realistically, despite such methods that can cope with increasing temperatures, there will continue to be reliability and human comfort constraints that will thermally cap what is achievable, he noted. As such, thermal management will continue to manage with, rather than control, such a future.
That heavily leverages the initial design, but solving this all with design isn’t so simple. “Logic simulation-based approaches to try to estimate how much heat I’m generating in the chip is a pretty tough art, especially when a lot of it is dependent upon software that hasn’t been written yet,” said Drew Wingard, CTO of Sonics. “That’s a real issue. But clearly, people are building chips that have to live within thermal constraints, so I wouldn’t say it’s out of control. I would say that a lot of people think they can do better than what they are doing now from a modeling perspective. The tools for doing the analysis given a design, and given a realistic set of stimulus — be that vectors, or emulator-based activity factors, or something like that — they can tell you how bad off you are. I don’t think that’s the worry at this point.”
The bigger concern is how to factor thermal issues ahead of time in order to do better prediction during the architectural phase. This is because architects typically need relatively good models that come late in the process, he noted.
Further, there can be a big benefit to trying to do this energy control in hardware.
“We can keep more of the chip off more of the time, which helps us generate less heat,” Wingard said. “The fact that this methodology essentially results in things automatically turning themselves off when they are done with what they are doing means there is a lot less analysis necessary to try to figure out what the worst case really looks like. You need to convince yourself that in the use cases you care about, you’re not using all the transistors on the chip, which you clearly can’t do for very long. If you are using all the transistors on the chip for any length of time, then it’s a very simple calculation to say that you’re toast from a thermal perspective. We know that. That’s the original insight behind this whole dark silicon phenomenon, which is that the transistors use so much energy that we can’t afford to turn them all on at the same time.”
Building up not just the processors, but the subsystems and the rest of the other pieces around the model that says they’re going to turn themselves off when they are done with what they are doing, is a great way to start with a much more stable base, he asserted. “Then you can say, ‘I know in this use case I need the following resources to be available and running. How bad is it going to be here?’ And then what can I do?”
The answer often involves a lot of fine tuning.
“The impact of running at a certain speed at a system level, studying that, understanding the thermal profile, and then designing your architecture based on that — that could use a lot of help,” said Krishna Balachandran, product management director for low power at Cadence. “That’s a big area that is still very new and nascent.”
Shanmugavel is settled on the matter. “The key to understanding thermal behavior is to have proper simulation coverage across the chip, package and the system. We live in the golden age of simulation driven product development. Applying the right simulations to the right problem is the difference between success and failure.”