Finding Hotspots In AI Chips

Considerations when determining where to place thermal monitoring sensors on a chip.

popularity

Things are getting far more complicated as we move down to 7nm & 5nm but the tolerances of some of the physical effects that we have been measuring in the past are much tighter than they were at the older nodes. How do we track all that?

What we see is that as we descend through the advanced nodes, say from 16nm down to 12nm, 7nm and more recently 5nm, we see that gate density starts to have an effect on many areas of the dynamic conditions on the chip. Quite often, there are more hot spots, they are more localized, there is a bit of variation of supplies and also you have challenge of process variability across the device as well.

In order to measure these variations in the dynamic conditions you need to decide whereabouts on the device you are going to monitor these dynamic conditions and these variances in the process. Of course, there is that law or rule that says whenever you try to monitor something you are displacing some of that inherent information on what you are trying to measure in the first place.

A typical type of arrangement for an AI chip has a general structure of a multi-core architecture which can vary tremendously in scale from either a data center-type AI chip compared to something that is more on the edge, maybe in automotive.

What tends to happen on a typical AI chip is that a lot of these are always on and you are moving data through at very high speed. This makes monitoring temperature differences across the chip is even more important compared to a smaller device where it is simply on or off. The conditions are much more extreme, and one of the things we find is that you may have a multi-CPU, multi-core architecture and it can consist of hundreds, or maybe even hundreds of thousands, of cores. What tends to happen is that the workloads are very bursty as they run the algorithms and as they execute on the compute, so we see there is often a case where you cannot quite deliver enough power to have all cores operating at once. Since you never reach 100% utilization, you have to make the most of the power that you can deliver to the device.

Essentially what we are looking at is load balancing across the chip that needs to happen dynamically. If you start with an uneven balance of load and workload across the cores that can cause stress to areas of the chip that are are over working, mainly due to the heat that’s generated when certain regions of the chip are over active.

When placing the monitors on the chip it is important, firstly, to consider that you are working with repeated structures, so multi-cores perhaps grouped within clusters. What you would tend to see is that the monitors are placed per cluster but then the placement of those monitors is repeated along with the clusters. It becomes quite uniform and that also makes it easier for the design teams and those doing the floor plans to handle the repetitive nature of the monitor placement.

Sometimes as these devices heat up and because of the conductivity of the silicon the heat can drift across the chip. After an amount of time there will be a thermal dissipation or thermal flow through the silicon, after all the silicon device and the silicon itself is very thin so that drift inevitably does happen. But, we also see anecdotally form the customer base that we have is that you actually get hot spots that are maybe 20 to 30 degrees higher than other areas of the chip, which is quite significant.

If we talk about accuracy, it is very desirable to have the entire die being monitored thermally, but of course there is an overhead to any sensor that you put into the chip, so you have to be able to distribute them carefully and in some sort of granular way. You need to be aware that there will be distances between where the actual hot spot lies within the chip and a particular core that’s being over worked compared to where the sensor is actually placed, so there can be a little bit of a correlation required between that hot spot and where the sensor is.

In terms of where to the place the sensor to make sure you are getting an accurate reading there are few tools that can be used as part of the development flow and basically good practice. There are a lot of thermal analysis tools out there that, when they have been run over the chip for different workloads and for different software and different activity profiles, you can start to see where the hot spots are and can give you some general guidance as to where to place the thermal sensors. Also, it can often be down to the floor planning and where there is available space, but quite often we recommend that you do place the sensors as close to the cores and the highest density grouping of cores as possible.

It is important to plan the placement of monitors up front, early in the design in terms of maximizing their capability in terms of architecture. It’s all about forethought forward planning and maybe using some of the simulation tools to help you in that design flow.

The interesting thing about AI architectures is that depending on how they are developed and designed, large scale designs for data center environments can be quite large going to maybe reticle size and maybe drawing hundreds of Watts of power. On the other side, you can put AI on the edge and the way it is being applied to automotive mean server grade systems are being placed actually within the car itself. Obviously they have to downscale that and also have to think about the longevity of the devices, reaching 10, 15, maybe even 20 year lifetimes within an automotive context, whereas within a data center the equipment/technology may only last for 3 or 4 years.

If you have AI devices in a data center context and hence the device is quite large, there is of course the supply level that’s at the input pin which may not be at the same level in the center of the chip, so you do have that IR drop effect and trying to monitor that is quite an interesting area. Dynamic issues regarding IR drop are especially relevant to AI applications and AI architectures because the bursty nature in which the cores are being utilized mean you can then get high demand, quite quickly with an in-rush of current into the device and then a pulling down on the supply. This results in a droop on the supply and that can be quite important to try to monitor and capture so that you can compensate for that with some of the voltage regulators that are supplying the chip in the first place.

In terms of the longevity of the chip and also how well it functions, we categories things into two areas. There’s the monitoring for the dynamic conditions, the live conditions that depend on what sort of activity profiles are on the device, and then there are static conditions based on how the chip is made, built in conditions which are relevant to monitoring.



Leave a Reply


(Note: This name will be displayed publicly)