Electromigration, questions about which problems to fix first are raising red flags in leading-edge designs.
From high-end consumer devices to rack-mounted arrays inside of data centers, thermal issues are becoming more serious—and getting much more attention. Driving this shift is the move from single chips to 3D ICs, whether they are interposer-based or stacked die.
It’s a well-understood challenge: Die stacking can cause thermal issues because of the lack of a readily accessible thermal dissipation channel compared with a single chip in a package on the PCB. As such, thermal vias or additional through silicon vias (TSVs) are necessary to conduct the heat for 3D IC architectures.
Even finFETs generate more localized heat. Given that a 16nm finFET has 25% more drive capability compared with a 20nm planar transistor, plus a higher gate density, this results in 25% to 30% more power density in a local area, according to Norman Chang, vice president of product strategy at Ansys-Apache. “That translates to a higher local self-heat. Self-heat means that the temperature increases in the local area due to the power. That’s combined with the thermal boundary conditions for the device and also for the chip.”
2D versus 3D
Historically IC design has considered the die temperature to be uniform but this is no longer a valid assumption in many cases, pointed out Michael Buehler-Garcia, senior director of marketing, Calibre Design Solutions at Mentor Graphics. “Heating due to current leakage, which is temperature dependent, is making power dissipation less uniform. Moreover, the use of thinner die, now below 50µm, has reduced the heat spreading capability of the die itself. Both of these effects contribute to potentially greater on-die temperature variation.”
When you add these factors into a stacked 3D IC design, chip-package co-design becomes essential. “Chip design cannot be done completely independently due to thermal interaction when dies are combined in a stack. One of the effects that need to be considered is inter-die interconnections (TSVs), which help get heat out of the die stack to counter thermal issues. However, TSV placement relative to high power regions on the die can have a marked effect on the overall thermal performance. Likewise, conductive heat transfer is a highly 3D phenomenon as the package temperature distribution affects the temperature distribution on each die and vice versa. [As such,] taking very thin, advanced process node chips with their potential thermal issues and then stacking those together means you need to understand all the potential thermal interactions,” he explained.
Accompanying these effects, the via electromigration (EM) limit decreases 20% per generation, Chang said. “With all these factors combined together we will have more severe thermal issues, regardless if it is on a single chip for 16nm or below, or for 3D IC designs.”
Brad Griffin, a product marketing director at Cadence agreed. “One of the key issues with silicon interposers and 3D ICs has always been the thermal aspect of it. With the package and board being made up of more metal material, the resistance and the power dissipation increases at a higher temperature, so you’re going to do an IR drop analysis, you need to understand what those temperatures are going to be.”
Cadence’s technology does electrical/thermal co-simulation, looping the electrical and the thermal together until it converges. It iterates back and forth until it converges, he explained. That is extended into the chip side of things where power and thermal maps can be passed back and forth iteratively.
Chang also pointed out that thermal-aware EM issues must be dealt with. Those are caused by a number of factors. First, the leakage power is a function of temperature, so when the temperature increases the leakage power also increases. This means thermal power convergence is needed to arrive at the final temperature — the thermal distribution on chip. Second, because of super low-power techniques, which are very popular in today’s designs, the thermal gradient on the chip will increase. “This means for a large SoC we can easily see more than 10 to 20 degree thermal gradience.” A third factor contributing to thermal-aware EM issues is the self-heat on the wires and vias.
“When we look at the traditional way of doing EM signoff it is done at 110 degrees, but at advanced nodes if you signoff at 110 degrees you will see a lot of EM violations and you probably do not have the resources to fix all of them. So we have to prioritize these EM violations and determine which one to fix first. That’s putting the temperature effect into this problem. If you sign off at 110 degrees and you have a ranking of EM violations, if you put in the real temperature, which is the thermal distribution on chip, then your EM ranking changes a lot because the EM limit is also a strong function of temperature. So when temperature increases, the EM limit will decrease, which is from Black’s equation. In order to meet the same mean time to failure, Black’s equation says if you increase temperature, the current limit should decrease. That translates to the decrease in the EM limit. People need to start thinking about which EM violation to fix first considering the thermal effects,” he continued.
Once the thermal effects are considered with the new ranking of EM violations, there is more flexibility and a more realistic view to see where the important EM violations are that need to be fixed.
The EM violations are prioritized by first calculating what the final thermal distribution is going to be, including the chip, the package and the PCB environment, along with the thermal boundary conditions from the ambient, Chang added. “After you calculate the final thermal distribution on-chip on every layer of the metal, including the device layer and every metal layer, then you will have a realistic view of the temperature distribution. Then each wire location will be recalculated against the new EM limit considering the temperature. That will give a new ranking of the EM violations considering the temperature effects. That’s where you get to decide where are the violations that you get to fix first.”
More than the chip
Jerry Zhao, director of product marketing for power sign-off at Cadence, noted that thermal issues are not just on the chip level. Failures will happen on the chip, but are related to the package and board and the surrounding temperatures. “If you just look at a die individually, if you don’t do anything it will still have leakage and the leakage will consume power and it will drive up the temperature. If the temperature goes up, the leakage will also go up. If you put them together, especially with 3D, those power consumptions will increase the temperature. So, we analyze the chip with the package models to see which area of the chip is consuming how much of the power and create a power map.”
He noted that at the chip level, thermal and power consumption are linked together. “Thermal actually will not just contribute to the power problems, it will also contribute to the timing problems. We can’t just say, ‘I signed off my timing, let’s go to tape-out.’ We have to go through the power sign-off also. Therefore, electrical design sign-off has two pieces: timing and power. They are highly related. If you have a voltage drop where the power consumption is huge and/or your IR drop is beyond a certain range, your timing is not going to close. Thermal and power are tied together. Power and timing are also tied together. That’s also a challenge moving forward to IC designs. How are you going to do that without putting too much guard-band on the table?”
A blueprint for electrical design signoff
The best way to approach electrical design signoff including thermal is a holistic approach, Buehler-Garcia asserted. But he said to keep in mind that while these steps are all rational, it is hard to have all this data when needed and in the hand of the right people. Why? Because for 3D the data, models, and expertise require integration of functions that have classically be independent of each other.
With that in mind, the first place to start is with the package to get the correct temperature distribution within the die. “It’s essential to include the package construction in the thermal model, mounted on a typical PCB, and where appropriate, with a representation of a heat sink solution, so that the effect of heat spreading in the board and into the heat sink are accounted for in the predicted package temperature distribution. At this stage, the number of dies, and the intended size and budgeted power for each die should be known, and the chip-package architecture can be optimized. The thermal model of the candidate package(s) can then be used to investigate the influence of thermal performance on different die arrangements, etc.,” Buehler-Garcia explained.
Once a candidate package is defined, this model provides the thermal environment that will allow temperature data to be back-annotated into the IC design flow. Providing the IC design team with information about the average die temperature and temperature variation for each die before the IC design process starts can greatly help floor planning. This is critical to the quality of the design, as decisions made during floor planning can either alleviate or exacerbate potential temperature variation. For example, it may be important to ensure that two or more different functional blocks operate at very similar temperatures to eliminate timing issues, he said.
“Once floor planning starts, the engineer performing thermal analysis should get a high-level power map from the IC design team and import that into the thermal model of the package. For 3D ICs, this should be accounted for when partitioning the design among the various dies, and during inter-die and intra-die floorplanning. Once the floor planning step has closed, the thermal design effort needs to focus on the detail of the thermal interaction between dies as the design is further elaborated during the placement phase of the design. The power map for the die then becomes much more detailed. Moreover, for stacked die, the number and location of TSVs needs to be defined as part of the electrical design. Finally, sign-off verification should include detailed thermal effects to ensure desired chip performance and yield are achievable,” Buehler-Garcia concluded.
While this may seem daunting at first glance, it is critical for designs at 20nm, 16nm and below. Automated tools from Mentor Graphics, Cadence, Ansys/Apache and others do much of the heavy lifting.