Better test quality is required as devices become more heterogeneous and denser and use cases become more critical; tradeoffs are cost and time.
Early results of using device-aware testing on alternative memories show expanded test coverage, but this is just the start.
Once the semiconductor industry realized that it was suffering from device failures even when test programs achieved 100% fault coverage, it went about addressing this disconnect between the way defects manifest themselves inside devices and the commonly used fault models. This is how cell-aware testing (CAT) came about some 15 years ago, dramatically improving the defect coverage inside the standard cells of logic devices.
The recent foray into device-aware testing (DAT) is the next generation effort in ongoing improvements to reduce test escapes. In a similar manner to CAT, what DAT strives to detect as many possible defects in the transistors and at interconnects. However, DAT is different in that failures are not necessarily modeled as resistive hard faults (stuck-at faults and opens), as is the case with most fault models to date. “If you look at realistic defects, then definitely not all defects can be modeled as resistive shorts or resistive opens,” said Erik Jan Marinissen, scientific director at imec.
The figure shows the example of technology process parameters and resulting electrical parameters for a RRAM device.
Fig. 1: Device-aware testing maps the effects of variation in process technology parameters and their resulting electrical parameters (CF = conductive filament). Source: TU Delft
Others insist DAT technology is not new, especially from the design side. “It’s a new word because we are trying to make the new technologies work, especially on the memory side. But for a long time, we have looked at memories differently because they are denser and more tool-dependent,” said Yervant Zorian at Synopsys. “And the defect densities are at least twice that of logic devices. We now look at transistor behavior not just electrically, but physically and environmentally because of specific nuances. For instance, in any one fin of a three-fin transistor, there can be a resistive change, but it’s not occurring in the entire transistor. Fins can be physically broken by a certain percentage, and we model them 10%, 20% smaller than nominal and look at the manifestation of the defects in the cell, at the different corners, temperatures, voltages, etc.”
The influence of such defects began perhaps at 14nm, Zorian says, but transistors definitely have become more susceptible to subtle fin height and width change defects at each node change (Figure 2). And as automotive and high-reliability chipmakers strive toward parts-per-billion defect levels, with many chips containing many thousands of microbumps, DAT is expected to offer unique capabilities in interface verification.
Fig. 2: The same fault behaves differently at different nodes. Source: Synopsys
Another key area where device-aware testing may prove particularly useful is in bringing to market advanced packages containing multiple chips, perhaps involving different vendors. The strategies currently in place can lead to over-testing.
“What happens all the time — and this happened when we went from the transistor level to the cell level — is the tests regenerated comprehensively, meaning we did what we did before but we just made it bigger. With multiple devices getting integrated together, we are likely to over-test because we’re not dividing and conquering,” said Dave Armstrong, director of business development at Advantest America.
This challenge is compounded by the combination of AC and DC signals, as well as a plethora of interconnects within and between chips. Once chips are powered up, thermal effects also can become formidable.
“Each device is a black box with a periphery that needs monitoring,” Armstrong said. “It’s about interoperability and controllability of interfaces between HBM and ASIC, for instance, which are sort of ‘no touch.’ So with ACs, DCs, redundancy repair, and the increasing number of microbumps, you can’t rely on every one of these connections to work flawlessly at all corners. How do you manage that? I think that’s where device-aware testing needs to evolve to.”
Cell awareness inspires device awareness
Cell-aware testing was developed by NXP Semiconductors in 2006, and later acquired by Mentor Graphics (now Siemens EDA). Today, all of the top EDA vendors offer cell-aware modeling and testing.
“Traditionally in testing, the standard cells were considered black boxes. We knew they had ‘AND’ and ‘OR’ functionality, but we abstracted from that and took the theoretical ANDs and ORs. The faults that we did go after were modeled as inputs and outputs on the standard cells,” explained imec’s Marinissen. “Cell-aware testing takes the lid off the box and does library characterization once per library. You exhaust all possible input combinations, so two inputs give you four test patterns, but 11 inputs give you 2,048 test patterns. Detailed analog simulation then to determine cell-internal defects are covered by which cell-level test pattern and cell-aware ATPG finds out which cell-level patterns can be expanded to the chip level and which combinations of test patterns best cover the most defects. The library characterization is very time-consuming, but it’s a one-time effort and you can re-use the information.” He noted that in cell-aware testing, transistors are treated as black boxes.
The ‘device’ in device-aware testing refers to the transistor or MTJ in a magnetic RAM or resistive RAM device. It’s any device that requires improved fault modeling and testing to catch manufacturing defects that currently escape test programs. As a relatively new terminology, device-aware testing can introduce confusion because in general speech, “device” can be anything from a TV remote control to a mobile phone or system-on-chip.
Nonetheless, there are justifiable reasons for the methodology and a whole new name. Scaling to the 10nm node and beyond introduces many failure mechanisms not captured by existing fault models. Increasing levels of process variation, and the 3D nature of finFETs and nanosheet transistors means testing must address new potential faults between gates, sources, and drains.
The attraction of resistive and magnetic memory chips is their inherent hysteresis. Extreme variation defects can be outside the pre-defined resistance ranges that determine the logic state of the cell. For instance, RRAMs intermittently can change their switching mechanism from bipolar to complementary switching, resulting in what is called an intermittent undefined state fault.[1] Engineers uncovered at least two defect-related faults occur during operation and are not captured by existing models, capping layer doping defects and forming process defects. The defects and fault mechanisms are dictated by the device, processes, and physics of operation.
First successes
Said Hamdioui, head of the Computer Engineering Lab at TU Delft, together with imec, has performed pioneering work in device-aware testing, including fabricated RRAM and spin transfer torque MRAM devices in the back end of standard CMOS devices. [2] “These devices have unique defect mechanisms which are non-linear by nature. Moreover, using linear resistors is even misleading in the sense that it can result in incorrect fault models that have nothing in common with the actual fault behavior of the device — leading to test escapes and waste of test time,” he said. In other words, the only test coverage or fault model that matters relate to real defects. One of these real defects relates to silent data errors (SDEs) — inaccurate mathematical computations that Meta and Google engineers identified in their operational data centers. SDEs only arise under certain usage conditions. “DAT can play an important role here,” said Hamdioui.
“Device-aware testing looks at the physical properties of a transistor. Now if you have to do that for a single library cell, you might already have four transistors, but some big library cells have 1,500 cells, so it becomes an enormous task to look for every library cell at the transistor level,” said Marinissen. “This is different for memories, where DRAM has one transistor. You look at a cell and its neighbors, so in three-by-three cells have all the direct neighbors to include for testing. But in logic, there’s too many library cells and too much variety, so this is why I see the concept as more applicable to memories than logic.”
Zorian concurs on DAT compatibility with memory devices. “For a long time we have looked at memories differently because they are denser and more tool-dependent. And what’s happening now is the amount of redundancy is increasing, so simulations include multiple rows — columns at the block level in MRAMs, for instance — and there’s much more granularity, so the simulation is much more extensive. But it’s not possible to do every short, so we do automated inductive fault analysis (AIFA) to find new defects to test for at each node, and there’s huge automation and algorithms around this to identify the critical defects. For automotive, you can go up to eight corners, and chips are tested at different temperatures and voltages.”
In fabricated STT MRAMs, Marinissen described pinhole-type defects in the barrier layer between the two magnetic layers in the storage device (magnetic tunnel junction). “The pinhole starts out small but can deteriorate over time,” he said. “This defect manifests itself as a stuck-at zero fault in case the defect is large enough. So in that case, the pinhole defect analysis did not lead to different testing, but it did contribute to better understanding of the failure.”
DAT in practice
Device-aware testing, according to Hamdioui, consists of three steps — defect modeling, fault modeling, and test development.
Physical defect modeling. The device model incorporates the way a defect impacts the technology parameters (e.g., length, width, density), and thereafter the electrical parameters (e.g., the switching time) of the device. [3] The defect modeling approach takes the electrical model of a device and the defect under investigation as inputs, and delivers an optimized (parameterized) model of the defective device as output.
Defects can be ‘weak’ or latent, meaning they manifest with aging or under certain operating conditions. DAT modeling “requires a deep understanding of the defects and how they manifest at the electrical/functional level for different stimuli — especially voltage and temperature.
Engineers want to keep the focus on targeting true faults, not rejecting false faults. This requires detailed fault modeling for the particular device type and actual defect signatures that can vary by node, device type, and use conditions.
Fault modeling. This effectively analyzes the behavior of a design in the presence of defects. First, the engineering team identifies the fault space that describes all possible faults. “Then, systematic fault analysis is performed based on SPICE simulation in order to derive a realistic faults space — i.e., actual faults that are sensitized in the presence of the defect,” Hamdioui explained.
In practice, most engineers use a combination of fault models — layout-aware, cell-aware, embedded multi-defect, etc., in both static and dynamic modes.
Test development. This takes the understanding of the nature of realistic faults to develop appropriate tests. For memories, Hamdioui said these could be a march test, a DFT scheme, special stress combinations, schemes that monitor a particular parameter, etc.
System-level testing
Testing content has been moving progressively upstream in efforts to detect failures earlier in chip manufacturing, providing more immediate feedback to previous processes. “Functional test content is moving to wafer probe. So is some structural content. But a lot of it is moving to functional test. And there are two wafer sorts — hot and cold — and the second is typically hot because with large high-power devices, there’s a risk of burning them up. You only do that on the ones you really care about,” said Armstrong.
He added that it’s doubtful there will be additional test steps associated with DAT. More likely, existing test insertions will have additional test patterns for DAT.
At this time, it’s unclear to what extent DAT will help with interface validation in multichip packages, for instance. Meanwhile, the industry is integrating more chips together — not just HBMs and processors, but multiple configurations, especially for consumer devices. The need to test interfaces and interactions between devices is clear.
“System-level test is an early version of this,” Armstrong said. “It has been designed specifically with this subdividing and focusing on the interface levels, making sure the interaction between the devices works correctly in terms of the thermal signatures and the signals are working at the right frequency. With new tools, we’ll be able to replace portions of SLT to make it simpler. But device-aware testing is likely to be more encompassing. I believe structural test has run its course and has contributed mightily. But functional test is the old/new rising star. In five years, my crystal ball says we’re going to have different silos of activity — one for structural test, one for functional test, and one for device-aware test or interface-fault test. They might have their own standards and DFT methodologies, but we’ll have to divide the silos to avoid the over test problem.”
There are tradeoffs to consider for all of them. “The advantage of SLT is that the device activation is much closer to the actual mission mode. The disadvantage is the non-availability of realistic coverage metrics.” said Davide Appello, product engineering director, Automotive Digital Products at ST Microelectronics. “In quality-sensitive market segments where the targeted escape defectivity level is measure in parts per million or lower, the estimation can only happen through statistics. At the same time, we shall also consider that (in my experience/opinion) in most cases we are really not in front of ‘physical defects,’ and definitively not ‘hard’ defects. The problem is often dependent on the combined effects of variations, with the actual performance centering of the specific device and the workload driven by the software for processors ICs.”
While DAT focuses on test quality improvement, as always, chipmakers will constantly strive to reduce test times and cost. “The biggest impact here from a test point of view is a possible increase in test times. At first this could be for devices with embedded RRAM or STT-MRAM memories,” said Ken Lanier, principal technologist of Teradyne. “If this extends to testing logic circuitry then it could lead to yet another family of structural tests that would once again drive longer test time and the need to reduce test costs through higher site count either at wafer probe, package test or system level test.”
Existing fault modeling and testing procedures will likely change to accommodate increased levels of process variability, especially at sub-10nm nodes. “We progressively observe an increased gap in the characteristics of fault activation offered by design for test (ATPG/LBiST) compared with mission mode. This evidence leads on one side to suggest the introduction of applicative test methods (e.g., system-level test) meanwhile, along the DFT path, with cell-aware and possibly device-aware testing,” said Appello.
And finally, device makers are incorporating lifecycle monitoring into their designs, which will become particularly essential in autonomous driving scenarios. “For aging, we look at NBTI and electromigration,” said Synopsys’ Zorian. “NBTI is for the big cells and electromigration susceptibility is in long word lines. In memories there is periodic self-testing that you want to perform every 500 milliseconds, for instance. “So we can monitor what is being tested and repaired, and we do that even on the error correction circuit (ECC).”
Conclusion
Even with the progress in device-aware testing performed on RRAM, MRAM and some structured logic devices, there is still serious work ahead. “[There’s a need to] abstract a model, which most likely should be parametrizable, to fit with fault induction methods applied to the circuit,” said ST Micro’s Appello. Also, DAT potentially allows better understanding of defect activation mechanisms, which are device specific. “Or, in a different perspective, an ATPG-equivalent algorithm which is capable to target a fault list on a given circuit.” Until engineering teams begin to do such analyses for the complex devices produced in HVM today, it is unclear how, when and where DAT will be applied.
For memory devices, fabricated MRAM and RRAM devices and their DAT optimization shows that unique defects can be isolated and built into test programs. The challenges of integrating an increasing number of chips or cores, along with all the related interfaces, implies a significant need for a methodology to better characterize and validate these interfaces.
At least for now, there are many unknowns around device-aware testing working their way through the design, test and chip making communities. “In substance, regarding DAT I may have more questions than answers, especially if we expand the application horizon from structured to unstructured logic and in general to SoC architectures,” said Appello.
References
Leave a Reply