New Memories Add New Faults

Why existing test approaches don’t always work, and what still needs to be done to ensure reliability.


New non-volatile memories (NVM) bring new opportunities for changing how we use memory in systems-on-chip (SoCs), but they also add new challenges for making sure they will work as expected.

These new memory types – primarily MRAM and ReRAM – rely on unique physical phenomena for storing data. That means that new test sequences and fault models may be needed before they can be released to high-volume production. In a similar fashion, new ways of building old memories like flash may introduce new faults, as well.

“In order to develop proper test algorithms, whether it’s for MRAM, SRAM, 5nm, 3nm, or RF, each needs to be looked at separately,” said Yervant Zorian, fellow and chief technologist, hardware analytics and test at Synopsys. “One must do fault injections in them, extract fault models, and develop algorithms.”

Understanding the nuances of the different storage mechanisms helps to motivate any new tests or inspections that might be needed to ensure faulty bit cells aren’t shipped. It also highlights the fact that any new memory cell must be scoured for novel faults prior to release.

In many cases, looking for faults in a new process is a standard process. For example, when a standard logic cell is moved to another process node, it must be checked to make sure that any potential faults due to the new layout are being caught by tests – and that any tests that are no longer needed with the new layout can be dropped.

These are cases where the unknowns are mostly known, and best practices can help to ensure good product out the door. But some changes are more significant, and they present more of the unknown unknowns. This can be the case when a radical new way of building an existing circuit is used or, even more so, when some new physics that has never been leveraged before is invoked.

“It’s tricky to figure out a new technology node or a new generation of devices because it hasn’t been done before,” said Samuel Lesko, director of technology and applications development at Bruker. While that may sound obvious, it’s a real challenge that must be met head-on.

Finding faults with new NVMs
NVM developers are exploring new phenomena for use in storing data in a manner superior to mainstream flash technology. Phase-change memory (PCRAM) is relatively well understood based on its prior lives, but STT-MRAM and ReRAM are still making their way onto the market.

MRAM, which had an earlier start, is slowly moving into SoCs and ASICs as an embedded NVM. ReRAM, meanwhile, has been highly anticipated for the last few years. But neither of them has yet achieved sufficient production volume needed for yield learning and lower cost.

In fact, there are still new mechanisms that may need to be tested in production. These represent situations that are unique to the technology and may not be discovered until later in the development process.

At that point, work is required to understand underlying mechanisms so that, in the best case, the issue can be eliminated outright. But very often the result of the work is a test that can be economically applied to ensure good material.

Testing of these NVMs may introduce some fundamental differences. In the case of ReRAM, there is a “forming” step that establishes where future filaments or channels will be. That happens at test. And both ReRAM and MRAM have calibration requirements. So what we blithely call “test” may include calibration – as well as repair if the device has redundancy.

“We always say that SRAM needs test and repair,” said Zorian. “For MRAMs and ReRAMs, you need calibration, test, and then repair.”

Fig. 1: New faults must be considered in particular when disruptive changes happen. On the left, flash is shown with a new implementation, 3D NAND. New faults may be found at launch, or when scaling to add more layers, or even during production learning. On the right, flash is replaced by two new memory technologies that need to be checked for new failure mechanisms related to their physics. Source: Bryon Moyer/Semiconductor Engineering

A ReRAM switching glitch
One ReRAM example [1], came from this year’s European Test Symposium (ETS). Presented by Moritz Fieback from TU Delft, it deals with the forming step. ReRAM operates by establishing or removing a conductive path through a dielectric. Exactly how that path works can vary, though. Some move metal ions, some move oxygen vacancies, but they all create a channel that may be slightly different each time it is created.

There are a couple of ways this channel can be established, with so-called bipolar switching or complementary switching. Bipolar switching is the desired behavior, where the caps at the end of the cell absorb or donate ions or oxygen vacancies. But occasionally, for reasons that are still being sorted, that cap may be saturated, making it unable to accept any more ions. That leads to complementary switching.

The problem is this happens unpredictably. “You can have 100 cycles in which no fault occurs, and three cycles at most in which this fault occurs, and then 100 cycles in which everything goes smoothly again,” said Fieback.

That, in turn, can reduce the strength of the logic 1 state, even over time. And this isn’t a rare problem. In TU Delft’s sample of devices, 40% showed this effect to some degree.

The fault can be detected by looking for cells with an undefined state – that is, somewhere between 1 and 0. But since it doesn’t happen very often, testing each bit in an array makes for a very long and uneconomical test. Stressing the device can make the effect more prominent, but may have other negative consequences.

While ECC sometimes can be a blanket solution to underlying failure modes, it doesn’t work in this case. “You could use ECC to prevent a failing cell that switches from zero to one,” said Fieback. “But the ECC would need to be able to detect undefined states, as well. And I’m not really aware of ECC that does it for ReRAM.”

The suggestion provided was for designers to give the sense amp a test mode that makes detection of the undefined state easier and faster.

Calibrating MRAMs
Two other ETS papers talked about challenges trimming reference voltages for the sense amps that determine a 1 or a 0 for each cell. Both approaches leverage memory built-in self-tests (MBIST) to figure out where the trim setting should be.

The first MRAM paper [2], presented by Christopher Münch, a researcher at Karlsruhe Institute of Technology, focused on temperature variation amongst bit cells in an array. These cells vary by temperature more than standard CMOS transistors would.

“The resistive behavior of the MTJ (magnetic tunnel junction) is temperature-dependent, meaning that, for interesting operating ranges from around -40 to 125 °C, the resistance of a cell in the P (parallel) state is nearly constant, whereas the resistance of a cell in AP (anti-parallel) state is decreasing with increasing temperature,” said Münch. “If we look at the resistive behavior of the transistor and the MTJ together, we can see that the resistive behavior of the transistor cancels the shift of the MTJ in a P state and additionally introduces a shift on the cell AP state. And one cell state varies more than the other – with transistors compensating in one direction and exacerbating in the other.”

The goal is to make sure that high-temperature behavior can be modeled without the need for an additional expensive high-temp test. They were able to model the extrapolation from low temperature to high temperature so that the MBiST runs used to identify the cell distribution could be run at low temperatures and still provide the information necessary to determine a trim voltage that would work at high temperature.

The other MRAM paper [3] dealt with one of the challenges of using traditional MBiST for a process like this. Normally, MBiST is used to identify faulty cells, and, when found, the test is paused while data is downloaded for further analysis. But in this application, MBiST is run repeatedly with different trim voltages to search for the edges of the cell distributions. Those edges make themselves known through cell failures.

It’s impractical and time-consuming for the MBiST function to stop and download data every time such a failure is encountered. Leaving that capability out, however, removes a potential diagnostic and learning mechanism going forward.

“While it is expected that, during the training procedure, certain memory cells may produce wrong read values, even if the correct reference resistance is selected, it is beneficial to know the exact locations of failing memory cells for multiple reasons such as volume, ramp up yield learning, on-chip repair, etc.,” explained presenter Artur Pogiel from Siemens.

So it’s useful to make that data available, but not by repeatedly stopping the whole calibration process to do so.

What the team did instead was to note this calibration is run before any logic tests are done, and some of the logic-testing hardware could be re-used. In particular, embedded deterministic test (EDT) employs compaction of output results for streaming out during the tests.

An additional convolutional compactor, with low die-size impact, can be used in this case for the MBiST results. That allows the information to be downloaded while the calibration continues without slowing down the process.

“The presented scheme assumes reuse of test data compression facilities that are often available on the chip to test the logic,” said Pogiel. “Therefore, the hardware overhead incurred by our scheme is low and boils down to a single test-response compactor.”

Other MRAM phenomena
MRAM is subject to other unique faults based on its construction and the vagaries of magnetic fields. Siemens identified a list of MRAM-specific faults while creating a test strategy.

Pinholes in the MRAM tunnel oxide, made out of MgO, can reduce the resistance of the cell, making the read process inaccurate. “Uneven MTJ interfaces (MgO – CoFeB) can degrade the MTJ’s polarized current flow, causing a wider distribution of write-resistance levels, which in turn can cause dynamic fault and reliability issues,” said Jongsin Yun, Tessent memory technologist at Siemens EDA.

Separately, under some circumstances, the magnetic fields from one cell can affect neighboring cells. And specific bridging defects in the cell can cause intermittent read failures. If the bridging has a low enough resistance, then every read fails. But with medium levels, the failures are sporadic.

Meanwhile, after a write operation, stray residual magnetic anomalies can cause an intermediate state between programmed and unprogrammed. A modified write operation can help to clean up the cell state.

There’s another phenomenon called “back-hopping,” where a just-programmed cell may spontaneously flip back to its prior state. It appears to be related to part of the reference layer — which is always supposed to be in one fixed state — becoming flipped itself.

“It is tricky to test, because this undesired flip can lead to resistance of the bit cell into either a P-state or AP-state,” observed Yun. “These defects are concealed in the temperature-dependent stochastic behavior of the NVM memories themselves”

Backend process steps may cause electrical shifts in the bit cell. “Integration and packaging processes after the construction of the MTJ stack may cause a resistance and TMR (tunneling magnetoresistance) ratio shift,” explained Yun.

These are all considerations that may affect testing and write or read operations in the field, which in turn can affect reliability. But it’s not yet known how many of them will remain relevant for full production.

“While such defects are interesting to a researcher, it remains to be seen what their statistical relevance will be,” said Yun. “Available data on the topic is a well-guarded secret of the manufacturer.”

Not even flash is immune
Meanwhile, NAND flash memory is about as far from being novel as is possible. And yet the means of building it has changed dramatically with 3D NAND. The physics is the same, but the physical arrangements are different. 3D NAND is pushing stacking technology in a way that no other device is.

“Stacking has its own complexities in the NAND world,” noted Subodh Kulkarni, president and CEO of CyberOptics.

The stack of thin films creates a completely new feature as compared to planar versions. Stacks of films mean that flatness is important, but it’s also critical that the films behave well as temperature changes. That has been the case so far, but the number of layers has grown dramatically to a maximum of 176 today, with more layers possible in the future.

“As the number of layers in the 3D NAND device increases, more film and thermal stress is being applied to wafers,” said Woo Young Han, product marketing manager at Onto Innovation. “This leads to increases in wafer breakage.”

While the stresses may be okay across a single die, that effect can be multiplied across a wafer before it’s diced up. “Exposure causes small cracks on the wafer bevel to grow, and then thermal and film stress eventually lead to wafer breakage,” Han said. “Our 3D NAND customers that have experienced increased wafer breakage are looking for a wafer-edge (bevel) inspection solution to prevent wafer breakage.”

Breakage is clearly bad for the broken wafer, but it’s worse for the equipment. “When a wafer breaks inside the chamber of a process tool, the process tool needs to be taken down for several days to be cleaned and re-qualified,” said Han. “It is very expensive, and memory manufacturers try to avoid this.”

This flash situation leads to an inspection requirement rather than a new set of tests. The benefit is that bad material is removed from the line early on, before it can cause more problems. Then, testing remains as it was, much of which has been established by JEDEC.

“We have not seen the need for more than JEDEC testing,” said Ishai Naveh, CTO of Weebit. “Sometimes people want us to test longer or a bit harsher – not necessarily because of physics, but because, ‘You’re new, and you’ll need to prove yourself versus the 40 years of 50 years of flash with all the statistics they’ve gathered along the years.’”

Tough-to-predict faults
The first examples illustrate the process often needed when shepherding a completely new technology toward commercial release. Issues may not be identified until the latter stages of qualification as less-obvious faults are uncovered.

The flash example shows that, even after a new approach has proven itself out, scaling it forward may uncover new issues farther down the road. The new effects might become evident only when the new scaled version is being readied for launch. Or, with a situation such as die cracking, it may not be identified until production volumes are run.

Tools and a methodical process can help to identify unexpected behaviors up to a point. Automation can help to ensure as exhaustive a process as is possible when exploring unknowns.

“Such a tool takes a bit cell and its neighborhood and injects defects at different points, and then plays with those defects in terms of their levels in terms of voltages and in terms of temperature,” said Synopsys’ Zorian. “So the entire environment is being replicated for different conditions. It takes layout into account, and we go even inside the transistor,” he noted, talking broadly about new memories and other new technologies like finFETs. “Let’s say a device is using a certain node with certain fins. We break the fins, we short the fins, and so on in the model.”

Such methodologies often rely on knowledge accumulated through years of making memories. But the more novel a new physical mechanism is, the harder it may be to anticipate the things that could go wrong.

The more innovative a new approach is, the more unknown unknowns there will be. That’s not necessarily news, but it’s important to remember when several significant new technologies are making their way to market.


  1. M. Fieback et al, “Intermittent Undefined State Fault in ReRAMs,” 2021 26th IEEE European Test Symposium (ETS)
  2. C. Münch et al, “MBIST-supported Trim Adjustment to Compensate Thermal Behavior of MRAM,” 2021 26th IEEE European Test Symposium (ETS)
  3. B. Grzelak, “Convolutional Compaction-Based MRAM Fault Diagnosis,” 2021 26th IEEE European Test Symposium (ETS)

NVM Reliability Challenges And Tradeoffs
What’s solved, what isn’t, and why these different technologies are so important.
MRAM Evolves In Multiple Directions
But one size does not fit all, and fine-tuning is required.
More Data, More Memory-Scaling Problems
Challenges persist for DRAM, flash, and new memories.
Taming Novel NVM Non-Determinism
The race is on to find an easier-to-use alternative to flash that is also non-volatile.


Ali Mahdoum says:

Another reason yielding fault data in the memory could be due to parallel (parasitic) interconnections. Those parasitics (resistances, capacitances and inductances) may affect the normal behavior of the memory cells. A prospective solution addressing this issue (but expensive in area point of view) would reside in inserting a wire (connected to the ground) between each couple of successive memory words. Such additional wires will also reduce the static power dissipation (which in turn, will reduce the temperature, enhancing then the reliability of the memory)

Leave a Reply

(Note: This name will be displayed publicly)