DRAM Test And Inspection Just Gets Tougher

Increased size, faster interfaces, and 2.5D/3D packages puts squeeze on inspection and test methods.


DRAM manufacturers continue to demand cost-effective solutions for screening and process improvement amid growing concerns over defects and process variability, but meeting that demand is becoming much more difficult with the rollout of faster interfaces and multi-chip packages.

DRAM plays a key role in a wide variety of electronic devices, from phones and PCs to ECUs in cars and servers inside hyperscaler data centers. Likewise, it is a necessary component in AI/ML, where the amount of data that needs to be processed and stored is rapidly expanding. What was once a simple memory is now a whole family of options, from inexpensive to high-capacity DRAMs (512GB to 1TB) with faster read/write operation transactions.

DRAM has some unique challenges, though. Despite the historic low cost of this type of memory, there is a growing demand for quality and reliability, particularly when used in safety- and mission-critical applications. This is at odds with the continued shrinking of bit cells, which makes them more vulnerable to defects and process variability. And while self-repair and error detection code (ECC) schemes can address manufacturing imperfections, engineering teams still rely on inspection and electrical test to do the heavy lifting.

Quality and RAS (reliability, accessibility and serviceability) expectations vary by application. There are widely different data rates and interfaces, depending upon data volume and transfer speeds between memory and processing. The high-bandwidth memory (HBM) interface adds its own level of complexity, because it enables DRAMs to migrate from circuit boards and DIMM slots to 2.5D and 3D packaging.

“The DRAM industry has been evolving their memory product lines to fit diversifying applications,” said Tadashi Oda, memory and system engineering senior director, at Advantest America. “DRAM was only one type, at the beginning, for a computer. But today, the application ranges from PC, server, mobile/tablet, to IoT, AI/ML, automotive. As the result, we have DDR, LPDDR, GDDR, and HBM. Each DRAM has unique technology challenges, and we are expecting more and more segmentation and challenges.”

Wafer inspection and process control
Yield ramping of new DRAMs is most effective when meaningful data is captured closest to the process steps in which defects and process variation occur. Yield, process, and device engineers rely on inspection during wafer production to screen defects and provide feedback on process variability. In addition, engineering teams measure key device electrical parameters during wafer fabrication by tying process variation or defects to the bitmap failures from a wafer.

Higher defectivity levels are a direct result of the continual shrinking of the DRAM cell. “Modern DRAM technology uses a buried wordline structure with a stacked capacitor, where the wordline is below the silicon surface to reduce short-channel effects,” explained Peter Pöchmüller, CEO of Neumonda. “A number of defects can occur during manufacturing that impact DRAM performance and which memory manufacturers commonly test for. One example is contamination during the etch process, where particles can block the lithography or etch process. Another source of defects is contamination in the wafer. If a wafer is contaminated, copper or other ions can move within the wafer and cause effects like variable retention times. Then, of course, there are variations in the manufacturing process itself that cause marginalities in the thickness of isolation layers, which can result in capacitive coupling variations of the memory. Imperfections in the crystal lattice of the wafer can cause temporary or permanent leakage in DRAM cell mechanism.”

Higher sampling rates, which can be as high as 100%, help catch defects sooner. Those are particularly important in sectors such as automotive, where customer demands for extremely high quality — as few as 10 defective parts per billion (ppb) — also drive higher inspection rates.

“In general, the inspection rate really depends on which process steps in wafer/assembly manufacturing, and the purpose,” said Nathan Peng, product marketing manager at Onto Innovation. “For example, for front-end after lithography inspection (after development inspection, or ADI) the inspection step does not necessarily need to be 100% sampling. If the customer wants to monitor process excursions during lithography, the sampling could be relatively low, especially for mature nodes. On the other hand, for process steps like outgoing quality assurance, 100% sampling is a must-have, since the purpose is for outgoing wafer quality control where a decision will be made for acceptance or rejection of the wafer or die based on the defectivity requirement.”

Device and yield engineers use electrical tests as early as FEOL to gain insights into process variation. “Parametric test is a kind of inspection that measures electrical characteristics (threshold voltage, on/off current, capacitance, etc.) of basic device elements of memory (e.g., memory cell, transistor),” said Yasuhiko Iguchi, marketing manager for wafer test solutions at Keysight Technologies. “As DRAM process node progresses and the number of NAND stacking layers increases, design margins of key memory elements keep shrinking. Due to that shrinkage, memory devices become vulnerable to process variation which affect retention, performance, or reliability. Process control monitoring using the electrical test of basic memory circuit elements becomes more important in wafer production for inspecting the change of electrical characteristics of those basic elements. Typically, the test is performed after fabricating transistor layers on a wafer (inline) and fabricating contact pads on a wafer (end of line).”

After wafer test, yield engineers use bit maps of failing DRAM cells to zero in on probable defect sources. This bit map comes from rastering the pass/fail data from all the tests.

“Built-in self-test (BiST) is still the foundation and the gold standard that creates memory results best suited for matching in-line defects to device impact at a cell level,” said Mike McIntyre, director of software product management at Onto Innovation. “When coupled with a bit map pattern classifier, BiST enables the device engineer to truly look at the sensitivities of the device as it ramps.”

Manufacturing test requirements
DRAM testing occurs at wafer probe and packaged test. The final assembled package, end-system requirements, and cost considerations drive the test flow, including ATE requirements and the associated test content.

Engineers use two insertions to manage the long test times for enormous DRAM arrays and the requirements for high-speed interfaces. In the first, all the memory test algorithms are applied at lower speed. For the second (a.k.a., known good die) the die’s high-speed interface is utilized, and memory test algorithms are run at operational speeds. These two insertions are performed during wafer test, and both may be repeated for package test.

“Wafer tests are conducted at relatively low frequencies of around 100MHz to identify weak cells and then repair them. For cost reasons, parallelism needs to be high and is achieved by about four touchdowns per wafer,” said Neumonda’s Pöchmüller. “This requires high-cost probe cards with 20,000 needles and with 2.5g per needle. It adds a high pressure of about 50kg per wafer. KGD tests require higher-speed arrays and full-speed testing at the back end, which needs to be executed through the probe card. This requires low parallelism and high-speed probe cards. For this reason, KGD typically doesn’t support the highest speed classes.”

A burn-in-process can be used to accelerate reliability-related defect discovery, applying high temperatures and voltages to the DRAM while toggling circuit signals. Both wafer-level burn-in and package-level burn-in systems are available. In some instances a system-level test can be added, in which actual workloads are exercised.

There are several reasons for wafer-level burn-in. One is to boost the reliability of DRAM die destined for 3D packages. Selling a wafer is another reason. “Some DRAM vendors sell DRAM as a wafer, not the package,” said Advantest’s Oda. “The buyer then packages and tests the DRAM. In this case the DRAM vendor should guarantee the wafer-level quality. They cannot do package burn-in, so they need to do this on a wafer before sending to their customer.”

Wafer level test and burn-in (WLTBI) allows memory suppliers the ability to stress all the die on the wafer to identify marginal and failing units. This enables the removal of infant mortality die to improve the multi-chip/heterogenous module yield. Modules are built by using several different technologies such as microprocessors, memory, silicon photonics, etc. where each of the technologies necessitates a different stress requirement in voltage, time, and temperature that can only be done at the wafer for the specific technology,” said Vernon Rogers, executive vice president of sales and marketing at Aehr Test Systems. “In addition, the thermal requirements for memory during WLTBI is growing, with increased density and frequency driving the need for higher thermal capability at the wafer, and driving the system supplier to deliver higher performance.”

Fig. 1: Two DRAM test flows, with optional steps highlighted by dashes. Source: A. Meixner/Semiconductor Engineering

Fig. 1: Two DRAM test flows, with optional steps highlighted by dashes. Source: A. Meixner/Semiconductor Engineering

What’s different about memory
Memory ATE systems differ from logic ATE systems in a couple ways. First, there are pattern generation and self-repair requirements.

“The major difference is that ATE solution for memory needs to equip APG (algorithmic pattern generator) and fail capture memory (or error catch RAM) to store fail information, access,” said Advantest’s Oda. “At wafer sort, failure analysis for memory repair is a must, and a key for process feedback purposes. The memory repair analysis on-the-fly is very high computing power, and ATE hardware accessibility is intensive.”

High-speed interfaces are a second difference. “For SoC devices, including memory controllers, high-speed I/O faults are covered by integrated DFT capability,” said Ken Lanier, director of strategic business development at Teradyne. “This means SoC testers are no longer tied to memory I/O rates. A 10-year-old SoC tester would probably do a fine job of testing a processor with DDR5 interface. Memory testers, on the other hand, must still do at-speed test to guarantee that a part operates at full speed, including extremely demanding timing tests. Memory ATE must also do this at extremely high site counts to offset the impact of longer test times for larger memories. This introduces huge challenges for ATE designers that require the development of specialized tester electronics that can implement this high-speed capability with an incredibly high level of integration. This also means new memory standards drive the need for new test equipment, so the useful life of the equipment becomes shorter.”

There is a wide range of DRAM test content addresses, and a number of memory-specific patterns for test refresh capability, cell leakage, and a long list of patterns that address cell faults. Many of these require specific data from neighboring bit cells.[1] For example, the infamous row hammer test stresses the ability of a bit cell to retain its data after successive reads.[2] And as previously highlighted, high-speed testing of memory performance and interface performance, respectively, check for operation timing specs (e.g., tAC, tRCD) and I/O timing specs (e.g., tVB, tVA).

“DRAM requires massive usage of internal tests that I like to call ‘secret test modes,'” said Neumonda’s Pöchmüller. “These are proprietary and used to achieve the high parallelism during tests, for example, by compressing multiple I/O into single I/O. Or, timing parameters may be modified to more critical values than natural operations to find weak memory cells.”

While some wafer defect/fault mechanisms are similar to those seen in logic technologies, the nature of reliable bit storage requires some extra attention. Due to DRAM cell density, the susceptibility to defects is extremely high. If every failing bit cell, row, or column was marked as a failure, product yield would be low. Decades ago, these realities prompted design engineers to add spare rows and columns, and the associated methods to execute repairs during testing. At wafer level, repair can be performed by a laser or electric fusing (e-fuse), but only e-fuse can be done at the package level. Error correction code (ECC) circuitry can manage single-bit failures during manufacturing test and system use. The chip area devoted to repair can be anywhere from 5% to 10% of the total area.

Parallel testing, meanwhile, is deployed at both wafer and package test to lower cost. “Because memory is a commodity, test cost is king,” said Oda. “Therefore, DRAM uses massive parallelism. For wafer test, the technology trend is one touchdown (1TD). NAND 1TD became common years ago and DRAM 1TD is also becoming an imminent requirement. To achieve this an ATE needs to populate many pin electronics and device power supplies. Advantest developed a new innovative test cell to enhance parallelism per given floor space by integrating ATE and handler in a compact cell.”

DRAM testing would not be possible without design for test (DFT), where a programmable memory BiST engine is the workhorse. By providing the ability to run numerous memory test algorithms, it enables engineering teams to trade off test time with test coverage during each phase of the device’s lifecycle. As an IP block, memory BiST also needs to accommodate the various DRAM I/O interfaces, whether that’s LPDDR, DDR, GDDR, or HBM. Each of those has different latencies, data rates, and protocols. Other DFT schemes enable burn-in test insertion and permit test parallelism with I/O compression. To find weak memory cells, DFT changes internal DRAM timings. DFT supports memory array self-repair and HBM lane repair.

DRAM repair primarily has been performed according to manufacturing test standards. In the past decade, JEDEC DRAM standards defined a post-package repair (PPR), which provides one row repair per bank.[3] The standardized methods enable repair upon boot up.

“The DRAM programmable test engine is rarely available to the end customers,” said Faisal Goriawalla, senior staff product management manager at Synopsys. “With the increase in the requirements associated with reliability and infield operation, this is changing. With extended lifespans, any in-field replacement of a DRAM in a server would be very expensive. DRAM vendors are now providing system available redundancy, referred to post-package repair (PPR). JEDEC has standardized this post-package repair for the DRAM vendors. System manufacturers are looking to have these solutions deployed in their SoCs to take advantage of the spare capacity in the DRAM and improve the in-field reliability.”

Challenges with DRAM in multi-chip packages
Whether a DRAM connected to computing SoCs in a 2.5 D package or a stacked DRAM using HBM in a 3D package, yield and test engineers need to address additional challenges beyond monolithic DRAM parts. Some are unique to DRAMs, while others are not.

Modern DRAM pitch size and high micro-bump counts create challenges in both inspection and test.

“More 2D/3D inspection and metrology solutions are required for multi-chip packaging products. For multi-chip packaging there will be extra steps of interconnection (fan-out, micro-bump, direct bonding, etc.) between chips, and the extra interconnects between chips requires more process steps,” said Onto’s Peng. “Defect inspection for broken or bridged RDL lines is required for fan-out. For micro-bumps, CD and height metrology is required, along with residue defect detection in the bump top. For direct bonding, inspection is needed to detect cracks, voids, and delamination.”

The HBM interface offers significantly higher data rates at lower power. It was designed as a wide interface (1,024 lanes), to be used in 2.5D and 3D package solutions. Yet the bump-pitch size and numerous connections for stacked die necessitate a lane repair option to accommodate faulty bonding between dies. This multiples as the die stacks become deeper (range from 4 to 16). This necessitates specific test strategies to find failing lanes and enable lane repair which DFT facilitates. But it is well known that inspection methods have a higher probability of screening latent defective bonds due to poor metallurgical contact.

Through-silicon vias (TSVs) and micro-bump connections enable stacked HBM DRAM die. As part of known good die expectations, test flows typically screen defective TSVs prior to die-thinning steps. Then, once the bonds between all the stacked dies are connected, they can be inspected and tested. Inspection plays a role in assessing bonding quality, die alignment (overlay), and die warpage.

“Currently they are using a copper pillar with solder cap,” said Frank Chen, director of applications and product management at Bruker. “Gradually it will transition to hybrid bonding as the pitch shrinks. With vertical stacking, die placement accuracy is critical to monitor and maintain. Excess shift causes bumps to stretch and disconnect, resulting in non-wets. Equally important is the compression which is measured as the bond line thickness (BLT). A large BLT can result in non-wets and small BLT can result in solder squeeze out. In some cases, the solder gets squeezed all the way out, i.e., not within the bond pad area.”

Fig. 2: Stacked die micro bump bonding defect identification and review. Source: Bruker

Fig. 2: Stacked die micro bump bonding defect identification and review. Source: Bruker

Hard shorts and opens failures are the simplest to isolate. In the case of partial or marginal bonds, there is a concern that not all of them will be identified. Yet some of these connections eventually will fail in the field. In these cases, detailed inspection can help, but it is not going to be easy.

“Currently, there are some gaps on a tool that is fast enough and has the sensitivity to differentiate between layers and to assess the attributes that affect bonding — especially when you go to 8-, 12-, or 16-high stacks. With these multiple layers, as dies become thinner, warpage and alignment become problematic. All that needs to be monitored,” noted Chen.

The incorporation of DRAM into multi-chip packaging raises the bar for test and DFT.

“In multi-chip modules you have extra challenges associated with testability, accessibility and multi-die diagnostics,” said Synopsys’ Goriawalla. “Also, HBM presents a challenge, as you cannot separately test logic die and memory die. You must test the two together. You need to test the interconnect between them. With this stack of die you need to be able to access it, which IEEE standard 1500 enables. And then, of course, your diagnosis needs to differentiate and isolate to do physical failure analysis. The DRAM-based engine needs to be on the logic chiplet. It sits between the controller and a PHY on the main chip on the DFI bus. In the test mode, it takes control of the PHY to run read/write instructions through the PHY to test the external memory and interconnect.”

Fig. 3: A 2.5D multi die configuration that supports DRAM testing, diagnosis and repair, Source: Synopsys

Fig. 3: A 2.5D multi die configuration that supports DRAM testing, diagnosis and repair. Source: Synopsys

For multichip assembly manufacturing, traceability is key to enabling operational feedback and die performance-matching.

“Once the die is deemed “good to go” the traceability and analytic techniques to ensure multi-chip module (MCM) compatibility and monitoring revision control are generally the same between memory and any other component type being used in an MCM,” said Onto’s McIntyre. “The best example is matching memory performance to the performance of the other chips in a package. This is a must. The consequences of placing a slow memory chip in a high-performance package could result in a downgrade of the entire package. This usually results in lower ASPs and profitability. Another example would be placing memory with compromised dynamic operating range into a package that will be deployed in an environment expected to be highly variable.”

Many observers perceive DRAM as a commodity product because of the cost. But lumping all DRAM into the same bucket is a misconception. Memory performance is essential for the overall performance of some of the most complex compute systems.

But given computing’s ever-expanding appetite for terabytes of data and faster execution in ML and AI applications, DRAM’s migration to denser bit cells and stacked-die implementations is complicating both test and inspection, and this is a problem that will only become more difficult to solve as faster memory and more complex packaging becomes the norm.

[1] “Testing Semiconductor Memories: Theory and Practice” by A. J. Van De Goor, publisher John Wiley and Sons Inc, 1998
[2] https://en.wikipedia.org/wiki/Row_hammer
[3] HBM JEDEC 2021 standard https://www.jedec.org/standards-documents/docs/jesd235a

Related Reading
More Errors, More Correction In Memories
New technologies increase the cost of accuracy as density increases.
Choosing The Correct High-Bandwidth Memory
New applications require a deep understanding of the tradeoffs for different types of DRAM.
Is There A Practical Test For Rowhammer Vulnerability?
New approaches surface for persistent DRAM issue.

Leave a Reply

(Note: This name will be displayed publicly)