The package is playing an increasingly important role in cooling down HPC systems.
As the semiconductor industry reaches lower process nodes, silicon designers struggle to have Moore’s Law produce the results achieved in earlier generations. Increasing the die size in a monolithic system on chip (SoC) designs is no longer economically viable. The breakdown of monolithic SoCs into specialized chips, referred to as chiplets, presents significant benefits in terms of cost, yield and performance. Chiplets offer the advantage to manufacturers of being able to shrink process nodes of only specific components, while maintaining others in more economical sizes. In addition, several power-related factors come into play. Heterogeneous integration has provided an alternate path to maintain pace with Moore’s Law instead of relying on traditional node shrink advancement. Compared to traditional monolithic construction, heterogeneous packaging presents unique challenges in thermal management due to their increased power density, physical size, and geometry.
Heterogeneous integration allows the packaging of components of different process nodes and functionalities into a single module. There are numerous approaches which include, but are not limited to, multi-chip modules (MCM), System-in-Package (SiP), 2.5D Through Silicon Via (TSV) silicon interposer, and high-density fan-out (HDFO). Some of these technologies have been around for quite some time, but only recently has there been a rise in popularity of breaking down large monolithic SoC die into smaller subcomponents, or chiplets, that are packaged into a single module. This type of heterogeneous packaging has allowed companies to maintain pace with Moore’s Law scaling and economics. The benefits of heterogeneous packaging are many. First, designers are no longer constrained to a single node technology. Individual functionalities can use legacy node sizes where it does not make financial sense to move to the latest node. Secondly, smaller die sizes allow more die per wafer and therefore less waste around the perimeter. Yield also sees an improvement because failed subcomponents do not require an entire SoC to be rejected. While there are many applications of heterogeneous packaging, this blog post will specifically focus on that of high-performance computing (HPC).
A complementary metal oxide semiconductor (CMOS) device dissipates heat in three primary ways: dynamic power, short circuit dissipation, and leakage. Dynamic power dissipation occurs due to the switching activities within the circuit when capacitors draw energy to charge. Historically, this avenue was the highest source of power consumption in a CMOS device, however more recently at lower nodes, the leakage current has played a much more substantial role. Leakage power has grown to encompass up to 50% of total power consumption for devices in high technology nodes, 65nm and below.
Power dissipation is considered the single most limiting constraint in chip design. Although increasingly more cores can be squeezed into a single piece of silicon, running all these cores at full performance simultaneously is not feasible due to thermal constraints. Instead, cores must be throttled or deactivated to mitigate overheating. The phenomenon of unusable silicon, often referred to “dark silicon,” where regions of silicon must be deactivated due to thermal issues, limits the overall device performance and efficiency because the device cannot operate at its full potential. Power dissipation is not a problem that is going away. As the efficiency of power-per-computation in silicon continues to improve, the overall package power density is increasing. Increased package power density requires careful design and material considerations to optimize thermal performance.
Before the rise of hyperscale data centers and artificial intelligence (AI) computing, 3-5 kW power density per rack was considered the norm. At this level, chips in the rack could be cooled using air-cooled heatsinks. The air-cooled heatsink would release air into the data center aisles where the heat would eventually be extracted with a chiller or refrigeration unit. Today, AI and other new HPC applications require more power per chip with some reaching over 500 W. Given a standard rack size, it is no longer efficient or even feasible to cool the rack by moving air. In fact, data center rack power density is expected to continue to rise and is anticipated to reach 15 to 30 kW per rack in the near future. This level of power density requires alternate forms of cooling.
Many advanced cooling solutions are being developed today. The ideal cooling solution could be implemented using existing infrastructure and not radically alter the data center environment. Heat pipes and vapor chambers use phase change heat transfer in a closed loop to achieve effective thermal conductivities dramatically higher than copper or aluminum. These technologies are already widely used today but still face the same challenge that they are implemented into heatsinks that must be cooled by moving air. The next progression in cooling is liquid cooling. This can be realized in two forms; by indirect cooling using cold plates or direct cooling by immersion; the latter being a much more exotic form. Liquid-cooled cold plates can achieve cooling of much higher power densities compared to air and can usually allow more rack space to be filled because they can be very low profile and do not require clearance for airflow.
Google Tensor Processing Unit (TPUv3) AI machine learning board with liquid cold plates.
Cray’s Shasta direct liquid-cooling system used in some of the first exascale supercomputers.
In immersion-cooled systems, the devices come into direct contact with a dielectric coolant. Depending on the coolant and configuration, immersion systems can operate with the liquid in single or two phases. Two-phase immersion cooling offers the advantage of constant temperature at each of the devices in the coolant bath, but these types of systems are more challenging to implement than single-phase systems. Immersion cooling requires a dramatically different data center environment because virtually the entire rack must be sealed to contain the coolant fluid. Since this is dramatically different than what is customary today, significant hurdles need to be overcome to make immersion cooling economically viable. Regardless, there is still much interest in both direct and indirect liquid cooling. The Open Compute Project (OPC) has two projects focused on developing standardized solutions for immersion and cold plate cooling. The mainstream air-cooled solutions widely used today will not be able to support the future demands of HPC and AI.
Two-phase immersion-cooled system. (Source: AnandTech – Gigabyte)
From a thermal conduction perspective, there is little difference between the most forms of heterogeneous packaging. In high-power packaging, it is uncommon to stack die or components on top of high-power die so only 2.5D or MCM-style designs will be considered in this discussion. Nearly all these configurations involve the same basic heat flow path through the top of the package. Starting at the junction, heat conducts through the silicon and thermal interface material (TIM), then into a heat spreader before dissipating into the system cooling solution. However, the many packaging options for heterogeneous integration each have their own unique process and physical characteristics that can indirectly affect thermal performance due to the warpage of the package and its impact on the thermal interface material.
For most semiconductor packages, heat spreaders offer advantages of thermal performance as well as protection of the silicon and warpage control. However, there are situations where directly exposing the silicon to the system cooling solution offers better thermal performance compared to a heat spreader lid. When a package interfaces with a very low resistance TIM II (thermal interface material between the package and the system heatsink) and a high-performance heatsink, such as direct liquid cooling, the actual spreading of heat within the lid is very minimal. In this scenario, heat mostly conducts directly up from the silicon, therefore it may be beneficial to remove the thermal resistances of the heat spreader and TIM I along this heat flow path. However, exposed silicon does not come without its own challenges, primarily the risk of damage to the silicon during system assembly as well as the reliability and performance of the selected TIM II material.
Considering an alternate scenario with a high-resistance TIM II and lower performance heatsink, such as a simple air-cooled aluminum heatsink, a heat spreader will typically offer a thermal benefit because the heat is spread over a larger area before exiting the package. The higher the thermal resistance of the system cooling solution, the more heat spreads within the package. Since thermal resistance is a function of area, a larger area of heat transfer into the TIM II and heatsink effectively “lowers” their resistance. With heterogeneous packaging, there is often high-power density disparity across the total area of the package. Therefore, this corresponds to even more potential upside for using a heat spreader. In addition, the larger the temperature gradient across the package, the greater the potential benefit of increasing the heat spreader thickness.
Package with integrated heat spreader.
Heterogeneous packaging also involves the challenge of dissimilar heights of individual chiplets of components, either due to manufacturing variability or simply different types of components (e.g. a chiplet versus a high bandwidth memory (HBM) module). With an integrated heat spreader, they can be manufactured so the different heights are compensated for by varying cavity depths. When considering tolerance accumulation with stacked die, it is most critical to maintain a minimum TIM bondline over the highest-powered die. Therefore, the heat spreader cavities should consider this in their design.
For most cases of high-performance computing, greater than 95% of the devices total power dissipates through the top of the package and into a system-level cooling solution. Within a package (excluding 3D), the only components along this path are silicon, thermal interface material, and copper (heat spreader material), except for exposed die packages where it is only silicon. Since silicon is the required semiconductor, and copper is already one of the best thermally conducting materials, the only variable in terms of material selection is the thermal interface material. Although the thickness of the thermal interface material is at least an order of magnitude less than the thickness of the silicon and heat spreader, it typically contributes greater than 50% of the thermal resistance along this path.
TIM I selection is critical for high-power packages. Not only does the material need a low thermal resistance, it also needs to be able to withstand the conditions a package experiences during assembly and its operational life. When a device is heated or cooled, either during reflow or operation, a TIM will experience considerable stress due to the coefficient of thermal expansion (CTE) mismatch of copper, silicon and organics within the package. Being able to maintain adhesion and cohesion during these stress cycles is equally as important as its bulk thermal conductivity. Obtaining a balance of these properties is challenging and currently it is most common to only find materials on opposite ends of the spectrum. Gel and grease-type TIMs are comprised of a polymer-based matrix that is loaded with conductivity particles, such as aluminum or silver. These materials are beneficial with their low elastic modulus, however compared to metals, they still have low thermal conductivities. Metal solder TIMs, such as indium, offer very high thermal conductivity at the expense of a very high modulus which challenges the TIMs workability and reliability.
Heterogeneous packaging presents a unique environment for TIMs. Not only does the TIM interface multiple components, it may also interface multiple material types depending on the package. In addition, the stresses the TIM experiences may be different compared to a large monolithic die. A benefit of a heterogeneous package regarding TIMs is that different TIMs can be used on different components. For example, a central processing unit (CPU) die could have a high-performance TIM while lower-power HBM modules could use an adhesive TIM to reduce warpage of the package.
TIM resistance is a function of thickness, bulk thermal conductivity, and contact resistance at its interfaces. By nature, heterogeneous packages typically are quite large, so this equates to a large TIM surface area. In comparison to the total contact surface area of the TIM, its thickness is orders of magnitude smaller. This means the bulk conductivity of the material plays a relatively lesser role in the total thermal resistance of a TIM. So, while advanced metal solder TIMs offer extremely high thermal conductivities relative to polymer-based materials, over a large surface area in heterogeneous package types the thermal benefit is only incremental. In addition, these high modulus metal TIMs are greatly stressed due to warpage of such large packages.
Detailed view of a particle laden polymer thermal interface material.
The first defense in thermal management is the silicon itself. Silicon has a relatively high thermal conductivity that is excellent at mitigating hot spots. Since heterogeneous packaging breaks down functionality to individual components, the heat spreading advantage of a large piece of silicon is lost. However, more often than not, this actually benefits thermal performance because heat generating components are spread apart, thus reducing their thermal crosstalk.
Thermally aware component or chiplet placement presents a major opportunity for thermal optimization of a package. Chip and package designers should carefully consider the electrical and thermal tradeoffs of component placement, especially when involving high power. When possible, high power components should be spread apart to more evenly distribute the power across the area of the package. However, the edges and corners of the package present constraint in terms of heat spreading, so high power density should not be located too close to the perimeter.
Without limitation on reticle size, heterogeneous packages can include much more silicon in terms of area. Therefore, their overall body size tends to grow as well. Today, it is not uncommon for an MCM package to exceed 70 mm x 70 mm. This relatively large package size presents challenges when integrating at the system level in relation to the TIM II and heatsink. To maintain adequate thermal resistance of a TIM II interface, pressure is needed. With the surface area of a large MCM, considerable force is needed to meet this pressure requirement. This produces stress not only on the package but also on the system motherboard as well. The high forces may require additional strengthening of the motherboard and/or heatsink mounting hardware, thereby driving costs up. If sufficient pressure can not be applied to the TIM II, the device will suffer from degradation in thermal performance. This issue presents yet another benefit of implementing a heat spreader in large heterogeneous packages: the variability in heatsink and TIM II application can be compensated for with a heat spreader to achieve more consistent thermal performance.
As discussed in the earlier sections, with advances in system-level cooling solutions, the package is contributing a more significant portion to the total system thermal resistance. There are many options for taking advantage of a thermally enhanced package. At the most basic level, improvements to the thermal resistance of a package will correspondingly lower the junction temperature of the device. A commonly referenced rule of thumb for semiconductor devices is that for every 10°C rise in junction temperature, operating life is reduced by half. Therefore, by lowering junction temperature through package-level thermal enhancements, the theoretical operating life of the device can be significantly improved.
Alternatively, a thermally enhanced package can operate at a higher power, since the system-level cooling solution can support the additional heat load, while maintaining the same junction temperature. With junction temperature typically limiting chip performance, this is a clear choice to take advantage of a package with improved thermal resistance.
A package that has features added to enhance its thermal performance will still dissipate the same amount of heat as the original design, however the temperature delta between junction and ambient will be reduced. At the system level, there are many benefits to lowering the thermal resistance of the package. Instead of lowering junction temperature, the ambient temperature of the environment or cooling solution can be increased, while still maintaining the same original junction temperature. In terms of air-cooled data centers, this can represent significant cost savings. Approximately every 1°C increase in the ambient temperature translates to ~2% cooling cost savings. A second option could also be the reduced requirements of heatsinks or airflow. By lowering junction temperature, smaller heatsinks or lower airflow can be used while still maintaining the same delta between junction and ambient.
Put heat transfer surfaces on the back of chips and circulate the coolant within the MCM. Much more flexible distribution of components including passives and irregular components in the power conversion. Labs have shown ability to draw more than 1 kW/cm^2 off the back of a chip with textured copper and circulating liquid.
Routing the heat through a rigid mechanical connection to the back seems literally “inside the box”.