Many factors influence how hot a die or IP will get, but if thermal analysis is not done, it can result in dead or under-performing systems.
Heat management is becoming crucial to an increasing number of chips, and it’s one of a growing number of interconnected factors that must be considered throughout the entire development flow.
At the same time, design requirements are exacerbating thermal problems. Those designs either have to increase margins or become more intelligent about the way heat is generated, distributed, and dissipated. This is so complicated that it’s also becoming a way to differentiate products.
Power consumed by the devices within a chip is turned into thermal energy. The amount of thermal energy contained within an area dictates its temperature. Temperature, in turn, is controlled by the flow of thermal energy, or heat, to adjacent areas and ultimately out of the device. But that takes time — often on the order of millisecond to seconds — which is longer than most simulations performed today.
Smartphones have led the way in thermal management. “Everybody wanted more processing capabilities in their smartphone, which means power was going up significantly,” says Melika Roshandell, product marketing director at Cadence. “At the same time, they started getting thinner, and that caused a thermal problem. Power mitigation strategies were not enough, and it got to the point that the chip would burn before they could even run a benchmark. They had to think smarter.”
In other areas, reliability is a driving requirement. “Broadly speaking, over the lifetime of the device, an additional 10 degrees Celsius increases the failure rate by a factor that ranges from 2 to around 10,” says John Parry, strategic business development manager for Siemens EDA. “A relatively high temperature might be tolerated for a short period of time on a small region of the die, but how many times can that be repeated before the device fails?”
Some companies learned the hard way. “We definitely know of a couple of companies that had designs fail because they didn’t consider the thermal impacts, and they literally got burned because of it,” says John Ferguson, product marketing director, Calibre DRC applications at Siemens EDA. “How much planning happens today? I would say very little, but it’s starting to change. Temperature impacts stress, timing, and power, and power impacts temperature. They all impact each other, which makes it really difficult to fully understand some of the conditions within a chip.”
Removing heat
The most common way to deal with the thermal problem is by removing heat from the device as rapidly as possible. That has been the domain of package designers and system architects. “A lot of companies are just consumers of a chip,” says Cadence’s Roshandell. “They don’t have the ability to change the thermal performance of the chip. They have to stick to outside solutions, and this can be thermal interface materials or heat sinks.”
Even that can require some significant analysis. “It is not just the package construction that affects how much heat can be removed,” says Siemens’ Parry. “The environment also affects how much heat is injected to or from the board, or removed from the top of the package via a heat sink. Attaching a heat sink, or using an internal heat spreader, makes the temperature distribution across the top of the package more uniform.”
A lot of progress has been made on packaging technology. “Consider a CPU chip dissipating 200W,” says Bernie Siegal, president and CEO of Thermal Engineering Associates. “This dictates that the path from the back of the chip to the thermal management solution has the lowest possible thermal resistance to maintain the chip junction temperature at an acceptable level. Hence, the prevalence of bare-die, flip-chip BGA packages for high-power digital chips.”
Solutions vary by industry. “High Performance Computing (HPC) draws a lot of power, but they are free to choose any type of heat sink,” says Sooyong Kim, director and product specialist in 3D-IC chip package systems and multiphysics at Ansys. “Sometimes people use liquid cooling. They can use any number of fans. These heat sinks are very bulky. But in a cellphone, you can’t afford to have fans or liquid cooling. It becomes more important to verify that heat is properly dissipated from your chips. You will have to consider the entire structure, and that requires considering the flows. They also need to balance the cost of manufacturing.”
Uniformity no longer can be assumed. “Most semiconductor packages were designed with the assumption of uniform heating (i.e., uniform distribution of power and temperature),” says TEA’s Siegal. “However, this assumption is not valid for a wide range of chips. Many chips have one or more topological hot spots that will cause non-uniform temperature and heat flow contours, causing limits on the chip/package power handling capability and requiring more attention to package design.”
Spreading heat
In the past, the industry focus has been on stationary computing applications. “In these applications, die have been quite thick so the bulk silicon acts to spread its own heat, reducing temperature variation across the die surface,” says Parry. “Individual die were designed in isolation, with the total power dissipation, maximum ambient temperature, and a single thermal-resistance metric used to estimate the die temperature. The package was relied upon to hold the die temperature adequately uniform.”
This means understanding the thermal time constant. “From a thermal time perspective, excess heating takes longer to impact a device because heat takes a finite amount of time to propagate,” says Siegal. “At the instant a hot spot is formed, the environment surrounding it appears as an ‘infinite’ heat sink. The hot spot temperature will increase only as the heat is sustained and propagates into the environment, causing the junction temperature to rise as it does. The time frame is usually in the microsecond to millisecond range.”
But that has been changing in the cell phone industry. “Cell phones have package constraint and in order to meet height constraints, you have to reduce anywhere you can,” says Cadence’s Roshandell. “The first place is die thickness, and that is the worst thing you can do for thermal. Silicon is the best heat sink that you can have. If you shrink die thickness from 400 micron to 200 micron, you will get to the point where you have to trigger thermal mitigation strategies much quicker. That means performance drops because you have to reduce frequency.”
The industry is struggling with these issues. “A lot of effort goes into thermal power budgeting across the die as an approach to ensure that the power density in different functional regions of the die are similar,” says Siemens’ Parry. “That would result in power distribution as close to uniform as possible. But the problem is more temperature maldistribution than heat source maldistribution, although the two do correlate. A simplistic explanation is that to function at the highest performance, different functional units that communicate with one another need to operate at the same speed. Speed is adversely affected by temperature. Timing issues arise in high-speed circuits when the temperature is not close to uniform.”
The tipping point
When the issues impact cost, it gets peoples attention. “Non-uniform heating may produce logic errors, analog chips may not have the gain or frequency performance expected, or light-emitting diodes (LEDs) may have wavelength and light power output shifts,” says Siegal. “In some cases, temperature differences across the chip can be compensated for in the circuit design of the chip. The effort to temperature-compensate can lead to higher product costs, including more design effort, increased chip size, and lower product performance.”
An increasing number of designs are becoming multi-die. “There are a lot of challenges with one die in one package,” says Roshandell. “Imagine if you have multiple of them, and they’re constantly talking to each other. Not only do you have thermal challenges, but you also have introduced significant stress challenges. You also have to avoid some of the thermal resistance within your package, which is air. What are the materials that you can use to avoid that?”
That takes us back to those lidless flip chip packages. “They offer a 10 to 20 degrees C lower junction temperature just by efficiency of the thermal implementation,” says Mike Thompson, senior product line manager at Xilinx. “But it also requires co-planarity. Co-planar means the tops of the silicon are all in one plane, they’re all at the same height. If the dies within a lidless package are not at the same height, that means that it is either challenging or impossible to get a heat sink to make contact with the dies, hence reducing thermal dissipation.”
Die stacking creates new issues. “Today in high performance computing, artificial intelligence and machine learning, temperature-sensitive memory die are stacked on top of, or located very close to power-hungry processors as part of a 2.5D/3D package,” says Parry. “These die interact thermally. We are close to the point where both die and package design communities have to move toward a co-design workflow.”
Thermal floorplanning
The earlier that analysis starts the better. “They generally start from their previous designs and some early power estimations,” says Ansys’ Kim. “We are beginning to see 3D-IC designs where they have to decide the locations of dies, or distance between the dies. This exploration stage can be well before any silicon design, so they start with very basic information such as technology, bump numbers, thickness of the package, type of PCB, and a specification for the heat sink. They want to do ‘what if analysis’ for optimization of placement and the density of the TSVs, or decide how many layers to go with, choose materials etc. These all have to be determined in the exploration stage. Then they can decide if this is a good die and explore further. Fine tuning happens from the exploration stage through the design cycle and up to sign-off.”
This analysis cannot be done in a vacuum, though. “Designers are getting smarter and they are doing a lot of floorplan optimization, not only with the big IPs, such as where to put the CPUs, where to put the GPU, where to put modem, but even within the IPs,” says Roshandell. “They now look at where to put logic and where to put the memory within a CPU core. There are a lot of challenges in doing this floorplan optimization, but thermal is only one of them. There is always timing involved, which is a lot more important than thermal. There are also PDN (power delivery network) challenges and routing challenges. And the most important optimization criteria is cost. A lot of times when you do floorplan optimization, the area of die increases. That increase means money. Then you have to evaluate if the performance gain with this floorplan optimization and thermal constraint is worth it?”
Over time companies can improve the accuracy of their optimization process. “Once the general package configuration is established, back-of-the envelope calculations and/or thermal modeling is used to determine if the product goals can be met,” says Siegal. “Typically, these estimates are in the ±10% to ±20% range. The accuracy is dependent on several things, such as knowledge of material properties, especially as a function of temperature, physical dimensions, final chip power dissipation (PDiss) and where on the chip power is being dissipated — plus, manufacturing repeatability. Once the product has been made, model iterations can be calibrated against actual thermal measurements and this can improve the resultant model accuracy significantly.”
Subsequent designs are likely to add to the challenges. “With each new technology node, transistors get smaller, and while the power consumption per transistor reduces, they are packed more densely together,” says Parry. “The metallization above the die is smaller and so more electrically resistant, which can lead to higher Joule heating, and the dielectric under the transistors is thinner, leading to higher leakage power. The challenges are increasing over time and becoming harder to understand. The proportion of the total power dissipation that is not within the transistors is increasing, driving the need for chip-package thermal co-simulation.”
Problem adjacency
Apart from cost, floorplanning must consider more than just thermal issues. “Some aspects of system design have more priority in terms of decision making than others,” says Roshandell. “The power delivery network is one of them. If you cannot meet the PDN requirements, the chip doesn’t work. More important than the PDN is timing. There are some obstacles where even though you know one floorplan is better than the other, you cannot make it happen. PDN and timing are the two factors that will stop a floorplan moving forward, even if we know it is better.”
Many issues are tied together, too. “The emergence of 3D integrated designs creates conflicting constraints between efficient power-delivery and thermal-management,” says Shidhartha Das, distinguished engineer for Arm. “Integrated regulation can provide better control of voltage-transients in 3D stacks, although at the expense of creating potential hotspots around power FETs. Careful floorplanning of power FETs and accurate runtime power-introspection are some of the key tools that need to be developed for high-performance system design that are increasingly limited by power-delivery and thermal concerns.”
Some of these issues are cross-discipline. “The PDN is an area where electrical and mechanical requirements work against each other,” adds Parry. “In the package substrate the stiffness of copper and its CTE mismatch with other packaging materials presents challenges. Power delivery within the die surface metallization, normally aluminum, increases as the current draw increases, while the wires closest to the transistors shrink at each process node. The thinner wires offer more electrical resistance to the current, and so dissipate more heat. Getting the required amount of current into and out of a package, and to where it is needed on the die surface, with an acceptable voltage drop is now an extremely tough issue, and one that will get worse if the supply voltage is further reduced.”
Defining the scenario
Few chips work under a constant workload, meaning that static analysis is unlikely to be acceptable. “Temperature gradients within the chip vary significantly depending on what IPs you’re using,” says Roshandell. “Some benchmarks are heavy CPU users and so they run a lot hotter than, let’s say the modem. You may see close to 30C gradient between cores depending on the use case, or power profile that they’re running. The hotspots within a core may get to the maximum temperature of 60C, while the memory may be at 30C. When you do FaceTime with your phone, all the IPs in your cellphone are active. The GPU is active, your modern is active, your CPU is active, your camera is active. Everything is active during that benchmark.”
Running these benchmarks in simulation is a challenge. “Everything is dynamic,” says Ferguson. “Your power is changing, your temperatures are changing, and you need to see how this is going to affect everything over a significant period of time. You may be interested in the temperature at a given point, and view that over time with a waveform viewer. Or you may want an MPEG video of a temperature color map that changes over time. This gives the users an ability to zoom in or to pause at certain times, or look for the maximum temperatures. The more data that you have to feed into it, the more accurate those things are going to be. But it’s tough to pull it all together until you have the actual data of every piece that’s going in there.”
Mitigation
Today, large digital ICs come with an array of on-chip temperature sensors, so that an active thermal management strategy can be used to throttle the power, usually by reducing the clock speed, when an upper temperature threshold is exceeded. “Modern processors are susceptible to a large dynamic range in current consumption with concomitant voltage-noise and thermal hotspot concerns, particularly around vector execution units,” says Arm’s Das. “Thermal hotspots have slower time-constants in comparison with relatively faster voltage-noise transients. However, mitigation of thermal hotspots often relies upon throttling mechanisms that have undesirable performance consequences.”
It also comes with significant challenges. “Temperature aware dynamic simulations can use either simulation or emulation,” says Kim. “It is necessary to confirm that temperature is actually controlled as expected. If they fail, they have to find better temperature locations or refine the control logic, making sure that these situations do not occur.”
In general, the community is getting smarter when considering these policies. “Initially, benchmarks were defined that did not consider thermal mitigation policies,” says Roshandell. “Then the customers of the chip became smarter and wanted to know the performance under thermal constraints. So thermal became more important because they knew that when thermal mitigation kicks in, the frequency will drop, and that frequency drop will impact the benchmark score.”
Some analogies can help put this in perspective. “Dealing with thermal challenges for chips has a lot in common with dealing with challenges in life,” says Ming Zhang, distinguished architect at Synopsys. “First, one would want to plan and explore options thoroughly, i.e., conducting architecture exploration to get an early understanding of design metrics such as performance, power, size, in order to drive selections of manufacturing and IP options, before lots of expensive work is done. Second, one should have a big picture view when executing, i.e., designing and optimizing with a holistic view of the entire chip or subsystem, which is especially critical for 3D-ICs where thermal impedances and workload per chiplet are interlinked. Third, turn lemons into lemonade. There is only so much that can be done with planning and optimization. Chip designers should be ready to leverage on-chip monitors and dynamic voltage/frequency scaling and workload scheduling to mitigate the thermal impact after manufacturing is done.”
Conclusion
Thermal management is not a critical issue for every design today, but an increasing number of designs are being affected by it. Analysis complexity and the timeframe for when this analysis is required makes it a huge challenge, not only because of the number of interconnected physical factors, but also the elapsed time required to fully understand the implications.
While thermal mitigation can safeguard a chip, a good thermal design can lessen the likelihood of chips having to be throttled. This has to be considered with over-arching cost concerns.
Leave a Reply