Thermal Is Still Simmering

While thermal is a more pressing concern than ever, uptake by engineering teams has yet to pick up steam.

popularity

With the ever increasing sophistication in today’s high-performance SoCs on top of sheer physics of device manufacturing, thermal is a much bigger concern than ever before.

It is well understood that thermal and power are closely related, and there exists a vicious cycle between leakage power and temperature: leakage goes up, temperature goes up; temperature goes up, the leakage goes up again. This can damage a semiconductor device and is a phenomenon that is independent of technology — and only will get worse as more transistors are packed into tighter die space.

FinFET technology was expected to solve the leakage problem, and to some extent it has, but FinFETs cause their own set of problems because by definition, the fins in those devices are not great at dissipating heat given the device structure, Krishna Balachandran, product marketing director for low power at Cadence pointed out.

The self-heating effect here is well documented too — the device heats up, and thermal coupling occurs with the interconnect layer that is above the device. “There is an exponential relationship between leakage and temperature so temperature can rapidly go up and there is a point at which in spite of all the heat spreaders and whatever you may have on the device, they may not be sufficient to cool the device enough. There might be a huge temperature gradient on the chip, and that could cause some kind of malfunctioning, even total failure. Not to mention that, long term, there is an electromigration and IR drop impact, which can cause the metal wires to have cracks in them, open up, driving increased resistance over time that will cause failures to happen faster in the field. There’s a lot of negative relationships like that between power and temperature.”

Considering this phenomenon happens across the board from mobile to high-performance applications and into the datacenter where outages due to overheating can cost upwards of $5,000 per minute, thermal issues must be solved.

There are a number of schools of thought on what the solutions look like.

Anand Iyer, director of marketing at Calypto stressed that SoC designers need to really review their methodology as well as how they are building SoCs today. “Traditionally, one thing people used to do generation to generation is RTL IP reuse. Today the IP reuse is itself may be a problem because some IP may be built without power in mind, and the same IP was reused over and over. Suddenly we need to look at these IPs and see whether they fit in the newer power profile, which is very different from the previous generations.”

Another thing he said SoC designers need to be aware of is that in the context of IP, previously the power problems used to exist in the leakage areas where engineering teams employed specialized power designers to look at them and determine how they could be fixed. Now, the IP itself is having a problem — it goes back to the core itself and designers need to take ownership of these issues. “Sometimes it may mean that it needs to be re-started from scratch but more often than not actually we need to look at potentially where the power is being wasted, then go and modify those portions. It is an interplay so they need to re-microarchitect the overall chip. It’s not that the whole architecture probably needs to be changed but at least the microarchitecture needs to be looked at and modified — it’s not just an RTL change, but a microarchitecture change.”

Iyer feels the tool side is lacking. “What people are doing is running power analysis which gives them some vision of what could be an issue but they are just modifying that blindly and seeing whether it can have an impact, and they are going through long loops. In fact, with the RTL design schedule shrinking, you don’t have a lot of time to actually do this kind of analysis. A lot of thermal issues are creeping up mainly because they don’t have much time to fix the issues. They’ll fix it in the software, move on for the time being but they lose their competitiveness in terms of either providing all of the functionality or making the chip slow.”

As far as thermal analysis potentially being built into tools going forward, he asserted there are good proxies for it. For example, the activity could be one proxy. This is essentially the activity integration over time that determines the thermal issues today, meaning, if there is a lot of activity for a longer duration that’s when it can translate into the temperature going up and causing thermal issues. But if there is a lot of activity for a shorter time, it may not cause thermal issues because it will even out over the length of time. “It all goes back to the cost effective way of managing that issue; if we can actually model the thermal issues into design activity and durations, then designers will be looking at that as those are the metrics they can understand much more clearly as opposed to dealing with the heat equation and things like that. Solving the heat equation is one of the complex tasks.”

Norman Chang is vice president and senior product strategist at Ansys-Apache reminded that because we are living in the mobile age, thermal has to be considered right from the beginning of the design. “When you are designing a mobile product the envelope of the maximum power consumption is completely controlled by the thermal characteristics of the system. For example, the iPhone or the iPad, the maximum temperature they can tolerate may be 55 degrees for humans — it’s not so much the junction will be melted but a human will feel uncomfortable when they have a device in their hand or on their lap which is 55 degrees or higher. This means the main chip maybe cannot be over 4 watts and the total power consumption in the mobile device may not exceed 20 watts, for example. Then you need to design a heat dissipation channel and because you cannot put a heat sink in a mobile device, you need to consider your case for thermal dissipation and make use of all the components in the system for thermal dissipation.”

In addition to addressing thermal from the architectural stage, simulation is also needed right from the start, all the way through to SoC sign-off, he said. “In the beginning you have to consider doing the thermal characteristics of the simulation given the scenario that the combination of multiple chips and their maximum distribution of the power envelope, and then use that for thermal simulation, all the down to SoC sign-off to determine the device temperature, and the wire temperature due to thermal coupling and the device coupling into the wire.”

Latest challenges
Some of the most recent challenges in managing thermal today, according to Ridha Hamza, vice president of sales and marketing at Docea Power, stems from engineering teams that are promising their customers a certain amount of performance but cannot reach it due to thermal issues. “Since the IC suppliers were before application processors, they don’t own the use case and they don’t own the housing; they reach the performance on development boards which don’t have all the thermal constraints, and they don’t have the actual code running on their IC. The trend we see now is to make sure they give their customers the right information on what performance they can reach — they are building up models for the end use application and they are asking for real software to run. How that’s related to thermal is that they are building thermal models for the complete system and that model is simulated with real life use case, they are applying thermal management algorithms, optimizing this algorithm for the real use case. That is the case for everybody who is developing an application processor and has to deliver multimedia and high performance stuff — phones, automotive, etc.”

If they don’t do this, they have to wait until they get the system in the housing, in its package to make trials on what phone management policy would work, and what wouldn’t. But mostly, they just limit the performance to avoid thermal issues. They sell performance which they don’t really deliver, he observed.

And this is a mounting problem because in those cases, software must be relied on for changes, and there’s only so much that can be done. “That’s actually the really big issue. Software development takes a lot of time. Even for software that is very close to the hardware, any development you are asking for thermal management takes months, and the market windows do not allow that.”

Another trend is occurring with engineering teams who do own the software, Hamza noted. “They have to provide some performance and they don’t have to play with thermal mitigation but there is more and more a need for more realistic use cases. Thermal is done usually by just doing some steady state or step response. For steady state they will not do any coupling between power and thermal; for step response they might have some coupling between power and thermal. They would inject some power and see how the temperature is and try to do the coupling between power and thermal which means they don’t end up with the power they injected, they have to calculate how much power they reach at the end. When this is done, they are assuming a very pessimistic scenario — they are taking a lot of margin. They do it to be safe, but that is not very competitive.”

Interestingly, there are engineering teams that are capable of getting a power consumption profile of the chip from an emulator or other means, and are then looking for thermal models that they can inject into the profiles, which represent a more realistic use case. This gives a picture of the leakage power and temperature distribution and helps them stay competitive, he offered.

Even with all of the activity among tool providers, thermal is still an emerging field, Hamza admitted. “There is a lot of culture to change — the thermal guys and the design guys have always been separated. They haven’t had a chance to work together yet. And the design guys and software guys are just getting to work together, so in order for our customer to mitigate performance and thermal, they have to put together the power guys, the software guys, the thermal guys — and not all companies can do that.”

“In many companies it’s so well clustered these guys don’t talk together and what we are seeing is that you still have to have some enlightened VP who says, ‘I have this problem, whatever needs to be done, put people together.’ That’s why it’s not mainstream yet,” he concluded.