The Growing Challenge Of Thermal Guard-Banding

Margin is still necessary, but it needs to be applied more precisely than in the past.

popularity

Guard-banding for heat is becoming more difficult as chips are used across a variety of new and existing applications, forcing chipmakers to architect their way through increasingly complex interactions.

Chips are designed to operate at certain temperatures, and it is common practice to develop designs with some margin to ensure correct functionality and performance throughout the operating temperature range for its expected lifetime. That approach is becoming less effective, though, particularly at advanced nodes and in designs where some processing elements are always on or chips are running at full speed, such as in AI chips inside of data centers or in edge devices that rely on a battery. And in the automotive space, carmakers are pushing suppliers to mitigate stress and electromigration, which can shorten the lifespan of parts.

As a result, design teams are beginning to shift from just throwing more circuitry into a design. While guard-banding is still used, it needs to be more carefully defined and precise, and it needs to be used in conjunction with more accurate thermal sensing and different thermal management schemes.

“We know that the operating temperature of a design is directly dependent on its power consumption, but obtaining accurate power profiles is complex and highly challenging—especially if there are multiple potential use cases or scenarios,” said Lee Wang, technical marketing engineer at Mentor, a Siemens Business. “Being able to more accurately model the workload and temperature profile of a design is crucial for optimizing the performance and reliability of a design. Thermal analysis solutions that are integrated into the chip design, implementation and verification environments would be a key enabler.”

Thermal guard-banding comes into play in multiple places in the design. For example, thermal monitoring circuitry has a certain amount of error, so designers frequently add circuitry to compensate for these errors. But that reduces performance and throughput as frequency throttling, circuit shutdown and other thermal management techniques are applied at more conservative temperatures. And it raises the overall cost due to over-design.

“There are a number of benefits if you are able to accurately sense and control your die temperature,” noted Ramsay Allen, vice president of marketing at Moortec Semiconductor. “These benefits come about through power saving, optimization of the device, and reliability.”

Allen cited an example involving two different temperatures sensors, both uncalibrated, one slightly more accurate, one slightly less accurate, by using a temperature scale of die temperature, there is target temperature of 85°C. Power management software can be set to take action, whether to slow down clock frequencies in order to bring the temperature of the device down or possibly to set a thermal/temperature alert within the software, he explained.

“If you have a target temperature of 85°C and your temperature sensor is say, +/- 5°C accurate, you then have a set point or a range of temperatures that can vary between 90°C and 80°C. Your software will need to be set at its lowest or worst case point to 80°C. Then, allowing for the inaccuracy of the temperature sensor again, we are compounding this issue as we still have to allow for the +/- 5°C, thereby making the lowest point actually 75°C. If we now take a more accurate temperature sensor with an accuracy of +/-2°C, again uncalibrated, you then have a temperature range of 87°C and the lower part of the range being 83°C. This means if you are setting your software to take action at that level, you still have to take into account the inaccuracy which is +/-2°C bringing it down to 81°C.”

By adding a good temperature sensor, it is possible to if the lowest set point for software at 81°C, while for a less accurate sensor it would be 75°C. By using a slightly better temperature sensor in this case, 6°C of die temperature could be saved. And depending on architecture and application, this 6°C better accuracy could equate to between 5 and 10 watts of power savings, Allen said.

Origins of guard-banding
Accounting for a certain margin of error is not a new concept. In fact, the approach dates back to the early days of RF design.

“In the design of an RF device, it can be applied to several frequency bands,” said CT Kao, software engineering director at Cadence. “To avoid interference among those application frequency bands, small frequency bands can be put in between those application frequency bands. It’s a kind of safety factor. In that way, guard-banding [in SoC design] is like a design margin that serves to ensure the quality and reliability of a design.”

Applying this concept to temperature variation is a more recent innovation. Prior to the introduction of finFETs, the biggest heat-related problem in design was from current leakage. The first generation of finFETs plugged that problem, but created another—dynamic power density. As heat becomes trapped in the fins of these 3D transistors, it can create thermal runaway, a problem made worse at 10/7nm as leakage current begins to creep up again and resistance/capacitance raises the temperature in wires. Add to that various sources of variation, new application areas, increasingly heterogeneous architectures and new use cases and thermal management becomes a much more difficult problem to solve. But in most of these cases, simply adding more circuitry doesn’t help, and in some cases it can make the problem worse.

Heat also has a big impact on the reliability of chips. Researchers at the German Karlsruhe Institute of Technology recently published a technical paper, “Dynamic Guardband Selection: Thermal-Aware Optimization for Unreliable Multi-Core Systems,” in which they asserted that circuit aging has become the major concern in existing and upcoming technology nodes. According to the paper, bias temperature instability (BTI) leads to an increase in the threshold voltage of a transistor, which in turn may prolong the critical path delay of the processor and could lead to timing errors. The researchers determined that in order to avoid aging-induced timing errors, designers should insert guard-bands either with respect to voltage or frequency.

In effect, guard-banding is still useful, but it needs to be applied differently and more opportunistically. The reason is that different workloads have different impacts on heat and ultimately system performance, which in turn can require different types of guard-bands. The researchers propose that guard-band types should be selected on-the-fly with respect to the workload-induced temperatures aiming at optimizing for performance under temperature and reliability constraints. Also, different guard-band types for different cores can be selected simultaneously when multiple applications with diverse properties indicate this can be useful. The researchers believe their dynamic guard-band selection allows for a higher performance compared to techniques that employ a fixed (at design time) guard-band type throughout.

Cadence’s Kao believes this paper raises a very important point about circuit aging. If the drain current of a transistor is decreased, that can impact the threshold voltage and possibly raise the clock rate, which introduces reliability issues. “In the very beginning of the device design, the designer has to consider this effect so they add a voltage guard-band. If it’s meant to operate at 1.3 volts, they add maybe 1.38 volts, and the same thing for the frequency. It basically adds a safety net.”

It also adds complexity. “Voltage and frequency guard-band is related to temperature, which is where term ‘thermal-aware’ optimization comes from,” Kao said. “Think about it this way—no matter if you have hardware to sense the temperature on-chip, or you want to have a reduced margin in voltage or frequency, which are temperature-related, the key thing you want to know more accurately is the temperature distribution on chip. Temperature sensors are one way of doing that, of course, but more fundamentally, what if the simulation tool can predict or simulate a temperature distribution more accurately? That’s the foundation to implement different guard-band approaches. But a temperature sensor on-chip occupies area by introducing extra circuitry. From an EDA tool point of view, we’re developing our simulation tool with those capabilities to predict the temperature on chip more accurately considering the environment effect, which means the package, the board, the whole environment.”

This introduces another aspect of transistor design, namely the physics of transistor aging and transient power impacts.

“The transient response means you have multiple IP blocks on your chips currently, such as in a cell phone,” Kao explained. “You watch videos, you talk, you do different things on your cell phone, and the chips inside operate dynamically as a function of time. The designer wants to take into account the time-varying effect of all of those operating IP blocks or chips. They want to optimize the power input to different IP blocks or different chips to get the best performance with minimum power consumption. It’s not just static anymore because it’s a function of time. If we have the right software we could develop a transient analysis or simulation without much difficulty, but we do need something to verify it to show our customer that our simulation agrees with this measurement.”

This is where an on-chip temperature sensor would be very useful. Cadence is working now from the engineering side to solve it all: transient impact, dynamic power and thermal all together in a simulation tool that accounts for the everything from the chip, to the package, to the system, Kao added.

Similarly, Wang noted that Mentor’s chip-package-system thermal solution allows designers to simulate the thermal impact on a design early in the design phase through to design sign-off.

“To get accurate results within the die, detailed boundary conditions that correctly represent the package and system construction can be applied,” Wang said. “Similarly, accurate die/package models can be passed outward to the system level for accurate system modeling. Fine-grain power is supported to capture hotspot effects and thermal gradients that could adversely impact wire delays and thermally sensitive circuits.”

Conclusion
Guard-banding is a well proven way of accounting for different use models, variation and unexpected interactions in a design. But as architectures and priorities change in technology, and as these chips migrate to the most advanced nodes, or in complex packages where not everything is fully characterized, just throwing more margin at a problem often doesn’t produce the desired result.

Margin is additive, and guard-banding increasingly is part of the overall budget for power, performance and area for an entire system. As such, it needs to be applied with more precision and in exactly the right place at the right time. That makes it much harder to guard-band a design, while at the same time also making it more essential in some cases.

Related Stories
Power Issues Rising For New Applications
Why managing power is becoming more difficult, more critical, and much more expensive.
Why Chips Die
Semiconductor devices face many hazards before and after manufacturing that can cause them to fail prematurely.
Dealing With Resistance In Chips
Contacts and interconnects becoming more problematic at each new node, but with fixes come tradeoffs.



1 comments

Pete Clarke says:

As a temperature measurement equipment supplier, using Non-Contact Infra-Red, this is very interesting.
Using thermal imaging , be aware that we only “see” the outer skin temperature. Even so, this can indicate localisation heat build-up to the resolution of sub mm “pixels”
Be careful of convection effects though. Be very careful of “shiny ” metal surfaces too.

Leave a Reply


(Note: This name will be displayed publicly)