Delivering On Power During HPC Test

1 volt is not the problem. It’s the 1,000+ amps.

popularity

The industry’s insatiable need for power in high-performance computing (HPC) is creating problems for test cells, which need to deliver very high currents at very consistent voltage levels through the power delivery network (PDN). In response, ATE, wafer probe, and contactor vendors are introducing some innovative approaches and test procedures that can ensure robust power delivery to ATE probe cards and packaged device load boards.

HPC devices require a PDN that can supply up to 1,000 to 1,500 amps of current. For the IC, package substrate, and ultimately the end-system board, this creates some non-trivial power integrity (PI) design challenges for engineering teams.  These PI challenges increase in test manufacturing systems due to the temporary connection from the factory wall power to the device under test (DUT), as well as a significant increase in interconnect length.

During test the DUT to test fixture point of contact is temporary connection has implications because the mechanical force affects contact resistance (Cres). More importantly, compared to the microns of interconnect at the HPC board level, a test system’s PDN can be several meters in length, and the number of interconnect discontinuities create many more opportunities for impedance mismatch. Thus, Maxwell’s equations governing power integrity are amplified.

In addition, exercising of CPU/GPU/TPU circuitry during test creates a significant current draw variation, in which inductance and capacitance along the path impacts dynamic power response. As a result, maintaining a stable power source using this temporary connection is an ongoing technical challenge.

These issues all get worse with AI/ML-specific computing devices, whose power trends have accelerated compared to CPU/GPUs.

“When you track out the GPU power draw trend over the last 20 years, the power doubles every three to five years. Now, with AI applications, we’ve seen a significant increase in that power draw trend. It’s doubling in less than every three years. This represents a dramatic increase in the power demand,” said Michael Keene, senior systems engineer at Teradyne. “The voltage isn’t really that high. But because the current is so high, every link in that chain needs to be optimized so we don’t dissipate a lot of power in those links. It takes a lot of copper to connect current from an ATE digital power supply to the DUT. It takes a lot of load board layers to get the current to the DUT. And it takes a lot of pins to get that current between the load board and the DUT.”

During test, power travels from the factory floor’s wall socket to the test platform’s power supply instruments, then to the loadboard or probe card. From there, it travels out to the probe tips or contactor pins, and finally to the on-die or on-package power delivery network.

Fig. 1:  Conceptual diagram of manufacturing test power delivery path from wall to die. Source: A. Meixner/Semiconductor Engineering

Without proper power delivery to the HPC device, the test process can have several impactful results:

  • Test escapes will affect quality;
  • Marking good parts as bad will impact yield, and
  • Incorrect performance binning for power and frequency envelopes will decrease profitability.

“Test accuracy, and therefore effectiveness, cannot be achieved without a robust power delivery concept,” says Davide Appello, vice president for the Center of Excellence at Technoprobe. “There are multiple combined factors that motivate this statement. For one, the testing for complex fault models — such as transition delay, small delay defects, and cell-aware faults — requires very accurate power supply conditions to ensure accurate detection of true defects and avoid overkills.”

At each major link in the power delivery network are their sub-components, each of which has specific tradeoffs and performance requirements that engineers need to consider. These include aligning ATE resources for high current for multiple power settings at multiple sites, and designing a DUT test board with a limited number of layers to reduce overall cost. Such engineering objectives are the drivers for innovations in ATE power efficiency, test fixture design, and probing/contacting technology.

Fig. 2: Components of power delivery network. Source: A. Meixner/Semiconductor Engineering

For HPC testing, multiple test cell (ATE, DUT test boards, software) attributes complicate the ability to deliver consistent power. As Trang Nguyen, vice president of NI R&D at Emerson explains, “When the ATE have parallel source measurement units (SMU) for high currents, it can make the software control complicated. The test cell design needs to minimize signal path from instrumentation to DUT because of inductance. The ATE also needs room for precision test equipment which takes up rack space. Load board design is different for high power DUTs since layouts need short and wide traces. If it is a high current application, then to reduce coupling, wide spacing between traces is preferred.”

ATE power delivery considerations
Achieving maximum power and dynamically responding to DUT current fluctuations are important ATE attributes for delivering high current to HPC devices. The maximum power that an ATE system can consistently deliver is controlled by the power efficiency of digital power supplies and remaining instrumentation, the power density per instrument, and the system’s thermal dissipation capability.

“While the power for HPC has grown, the input voltages remain low — around one volt or less,” said Ian Mazsa, senior director of product marketing for the Semiconductor Test Group at Cohu. “But the currents have increased up to hundreds of amperes in a single package. So physical delivery of power is a challenge. The large amount of current must be delivered over multiple pins specifically positioned around the package. That requires flexibility to physically configure resources in the tester to optimize the impedance path so that voltage and current are accurately delivered during test.”

 

Fig. 3: ATE specific contributions to the power delivery network. Source: A. Meixner/Semiconductor Engineering

Focusing just on the ATE subcomponents highlights where the bottlenecks can occur on the way to delivering maximum power to HPC devices.

“Considering only the tester, the main factors that limit the amount of available power an ATE can supply to a device are the maximum power ratings of the ATE system and the single instruments (e.g., digital power supplies),” said Simondavide Tritto, worldwide performance digital COE manager at Advantest. “Each test system has its global maximum ratings for the power it can handle/supply, but also local ratings for each single instrument in the infrastructure. These ratings impose some global/local restrictions to the maximum available power during a device test. Of course, these ratings generally account for several aspects of the tester hardware — the overall tester supply infrastructure, the cooling capabilities, the components, the current carrying capacity of cables, and connectors up to the docking mechanical interface with the test fixture.”

Focusing on the digital power supply design illustrates how incremental improvements can significantly influence the maximum available power.

“The power supply itself has conversion losses due to its functionality – measurements, forcing current, test program communications. So, optimizing all functional blocks, i.e. making them more efficient, allows us to deliver more current to the DUT,” explained Teradyne’s Keene. “If you had an increase in efficiency for your instrumentation, where you maybe had been limited to some amount of output current — let’s just say a few hundred amps — and if you boosted that efficiency by 10% to 30%, that could have a really big impact on how much power you can deliver to the DUT in that same form factor. We see the efficiency of the instrument itself as a big contributor to the overall power delivery network, because with a more efficient instrument you can drive more current into the DUT for the same draw from the wall.”

Others agree that a digital power supply’s power density can significantly limit power levels delivered to the DUT test board.

“The number of available DPS resources and their density is a key limit. The overall ratings influence the maximum digital power supply resource count that can coexist in a single ATE,” said Advantest’s Tritto. “Their density and layout are also parameters to consider when analyzing the power requirements, in order to avoid routing congestion and potential thermal runaway spots when designing the test fixture (probe card or load board).”

HPC device makers care about a consistent voltage delivered during test. This then dictates the critical need for an ATE and the test fixture to rapidly respond to current draw fluctuations. In addition, during a test program the power supply voltage levels are modified for performance binning purposes (voltage and clock frequency combinations). Changing power supply settings requires careful test program implementation to ensure correct values.

“The power demand of HPC devices often varies very rapidly, not only across different tests, but even within one single test execution. A typical example is the sharp power variation that occurs in the various phases of a scan test. For this reason, the tester must be capable of responding to fast changes in power demand without causing instability or delays,” noted Advantest’s Tritto. “And specifically, the response of the power supplies must be adequately fast to suit the power variations. A big impact on this fast response requirement comes from the power delivery network, and more particularly, from the relevant capacitor network and the right combination of bulk-versus-filtering and ceramic-versus-polymer components.”

Others also emphasize the interplay between current fluctuations and PDN from the various test cell components. “At the highest current levels there are voltage drops from the instrument (e.g., digital power supply) to the DUT. If that power delivery network isn’t managed carefully, then the voltage deviation on the die could vary more than the specified range. For example, in feeding 1 volt to a device core as we run a test patterns the current that core draws will change dramatically,” cautioned Teradyne’s Keene. “When those dramatic current changes occur, there is a resistive drop component that influences how the voltage deviates and there’s a high frequency component (capacitance and inductance). As those current levels change, the entire system needs to be optimized because the customer wants to have a repeatable accurate voltage at the important point for their test. And as these current loads increase — and maybe you’re pin-limited, or you’re limited in terms of your total load board area and where you can place capacitance and power planes — that all becomes more and more important to consider.”

Manufacturing test becomes more complicated when a device has power rails connected to different portions of an HPC device, and sometimes these differ in the voltage supply setting. For instance, the power rail(s) for I/O would be separated from the internal cores due to I/O switching creating ground bounce.

“Powering on and off devices is one critical phase of any device test program. The supply ramp for each power rail should follow pre-defined swim-lanes to ensure the device effectively reaches the operational condition. In other words, the test program must ensure that the general-purpose power supply instrument of an ATE behaves coherently with the specification,” said Technoprobe’s Appello. For example, in case the native slew-rate of the instrument is too fast compared with device requirement, the instrument may require a step-by-step control from the test program to generate a ladder leading to the target voltage.”

There are other factors to weigh, as well. “Considering that in a test program there might be numerous power-on and off sequences, the overall impact on test application time (TAT) might be not at all irrelevant,” said Appello. “It may easily represent 10% to 20% of the TAT and is therefore one of the most targeted and critical activities for test time optimization.”

Probe cards, load boards, sockets, and probe tips
The manufacturing test fixture represents the last link in the test power delivery network. It includes the probe card or loadboard, and the probe tips or contactor pins, respectively. HPC devices have 10,000+ power and ground connections that are temporarily connected to an ATE. This is a complicated connection due to mechanical force, contamination build-up on probe tips and contactor pins over time, all of which increases contact resistance and the overall electrical/thermal considerations.

After the power leaves the digital power supply it arrives at either a probe card for wafer test or a loadboard for package test. Akin to a package substrate RDL, these test fixture cards redistribute signals, power, and ground from the top (less dense array) to the bottom (denser array of pins). As HPC devices trend toward higher current draw, an increasingly greater percentage of pins must be devoted to power and ground, at ~60%. Thus, power pins are spread over multiple PCB layers and the subsequent traces. Power integrity engineers need to carefully design the traces to avoid power congestion and thermal hot spots.

Fig: 4: Board-specific contributions to the power delivery network. Source: A. Meixner/Semiconductor Engineering

Probe tip counts of 20,000 are not uncommon in HPC probe cards. This, in turn, results in higher contact force over this large array of needles between the DUT and probe head during wafer test, which requires lower force needles (  2 g/needle). Each contact also must carry more current.

“The growing power needs raise the demand for needles with higher CCC (current carrying capability),” said Advantest’s Tritto. “This requires a careful design of the whole PDN, starting from the power and ground planes in the PCB, with an effect on the overall thickness and weight, and then all the way through the probe card components, to ensure the needed power integrity up to the DUT pads.”

Contact resistance equals the resistance at the connection between a DUT’s pad/bump or lead/ball respectively and probe needle or contactor pin, which impacts the power integrity. In test manufacturing, after multiple test cycles the contact resistance will increase, eventually necessitating periodic cleaning. Yet in between cleaning cycles, Cres may not increase uniformly over the array (likely due to planarity issues). This can lead to problems.

“It is not sufficient for a contact to deliver relatively low Cres performances. The key point is the ability to deliver uniform Cres value across the entire set of contacts used by a given device power rail. With inadequate uniformity, the current will cluster to the subset of contacts showing lower Cres. Evidently, if the resulting current per contact exceeds the maximum tolerated, it is likely that the contact itself may get damaged (i.e., burned) and/or eventually remain stitched (i.e., soldered) to the DUT contact,” explained Technoprobe’s Appello. “For both probe card and sockets, achieving low and uniform Cres is a function of three key parameters — the planarity of the tips, head material used to realize contacts elements, and the mechanical force they actuate. And all three parameters should demonstrate limited variance.”

Another aspect of probe card interface design is the inductance and capacitance present in the power and ground traces. As highlighted earlier, the ability to both change voltage levels and maintain a stable voltage level during test of HPC devices requires a rapid from the power delivery network, i.e.,  frequency response. Hence, the PDN’s capacitance and inductance properties affect the response.

“HPC demands an incredibly high quantity of power at a relative low voltage, nowadays in the range of 0.7V. The device’s sensitivity to power fluctuations requires satisfying the power distribution network built on-chip. At wafer level, this of course requires contacting thousands of bumps/pillars,” said Appello. “This should happen with very short needles to minimize loop inductance. Concurrently, within the MLO (multi-layer organic) /interposer, we developed technologies (e.g. Capaciball) that permit deploying a large quantity of small capacitances to buffer and filter the high frequency noise in the very close proximity to bump locations (<1mm distance).”

Fig. 5: Capaciball top view and cross section. Source: Technoprobe

Conclusion
Testing HPC devices presents several significant power delivery network design challenges, which if not carefully considered can impact test processes. Efficiency of ATE instrumentation is one area that engineers strive for. Even a 10% increase in power efficiency can be significant.

Handling dynamic current draws from the HPC devices stresses a digital power supply, as well as the subsequent power delivery path’s ability to maintain a consistent voltage during test. Finally, the last mile of the network — the temporary connection between the DUT and the factory’s wall socket — requires innovation test fixture technology.

With the arrival of AI applications, the increasing power trends for HPC have accelerated current draws, and thus exasperated all the PDN challenges. Innovation will be required to resolve the situation.

Related Reading
The Rising Price Of Power In Chips
More data requires faster processing, which leads to a whole bunch of problems — not all of which are obvious or even solvable.
IC Industry’s Growing Role In Sustainability
Addressing energy consumption has become a requirement as AI takes root, but it requires changes across the entire ecosystem.



Leave a Reply


(Note: This name will be displayed publicly)