More Errors, More Correction in Memories

New technologies increase the cost of accuracy as density increases.

popularity

As memory bit cells of any type become smaller, bit error rates increase due to lower margins and process variation. This can be dealt with using error correction to account for and correct bit errors, but as more sophisticated error-correction codes (ECC) are used, it requires more silicon area, which in turn drives up the cost.

Given this trend, the looming question is whether the cost of correcting errors at some point in the future will cancel out the cost savings of the next generation of memory cells. Put simply, demand for more memory capacity continues to increase, but the underlying economics in the memory business are changing, and that could have a significant impact on the types of memories used in designs, as well as the overall chip architectures.

“Cost-per-bit scaling of DRAM is flattening node-over-node, but demand for affordable, high-performance memories has never been greater,” said Scott Hoover, senior director, strategic product marketing at Onto Innovation. “Secular drivers, including IoT, autonomous driving, and 5G communication are exponentially increasing data volume and edge compute demand.”

Others agree. “There will always be a drive to get more bits on each memory device,” said Howard David, senior technical marketing manager for memory interfaces at Synopsys. “And so the vendors will come up with the most cost-effective way of correcting the errors that are occurring due to the size of the cells getting smaller.”

Controllers historically have taken the responsibility for error correction. That’s changing as the memory chip plays a bigger role. As techniques other than bit-cell shrinking are used to increase capacity, controllers must become yet more complex. For the time being, ECC remains a robust technique. But if maintaining reliability is no longer viable with a particular memory technology, there likely will be alternative memory technologies that take over.

All computing architectures assume that whatever is fetched from memory is correct. Since the downstream users of memory data can’t deal with errors, it’s up to the memory subsystem to correct its own errors so that the rest of the system can count on it being right. At the very least, if there’s an uncorrectable error, then the memory subsystem can so inform the data consumer.

Memory errors used to be less common. A big contributor was the so-called “soft error” or “single-event upset (SEU)” that came about due to interactions with alpha and other cosmic particles striking the memory. This source of error still exists, but it’s external — it’s not intrinsic to the memory.

Nowadays, the very process of reading and transmitting the data can create errors. This is a more recent phenomenon, and it’s becoming more of an issue as process nodes and memory cells advance.

While the specific physical causes of errors can vary between memory types, the increasing unreliability of memory operations can affect both volatile and non-volatile memories. It has affected each at different times, and so existing solutions vary by type. But, in theory, no memory is immune from eventually generating errors.

Sources of errors
Broadly speaking, there are two main ways in which internal errors can crop up. The first is when reading the bit cell. The second is when communicating that result to the memory controller.

The reading process involves sensing some physical phenomenon, such as a capacitor charge for DRAM, a number of electrons for flash, and various resistances for the new non-volatile memories (NVMs). But each of those, in turn, requires detecting ever-finer distinctions between a 1 and a 0. If noise from any source hits at the wrong time, then a read value may be perturbed.

These sorts of errors may be temporary. “Maybe there’s a bit that has a transient error such that, if you go and read it again, it will be fine,” said David. “Sophisticated memory controllers have retry capability. If we detect an error, but we can’t correct it, we can give it a second shot.”

There’s one catch with DRAM, however. Because it has a destructive read, its contents have to be restored after the read. If it contains, say, a 1, but it mistakenly reads it as a 0, then it will “restore” the value as a 0, and now the error is permanent.

STT-MRAMs have an inherent stochastic component to the physics, and so they already need to have errors corrected. But it also raises the question of whether there is a limit with other memory types. So when margins, electron counts, or some other aspect of a read operation are small enough, quantum effects, which are inherently stochastic, may have to be taken into account.

“Pretty soon we’re going to get down into the tens to hundreds of electrons making the difference,” said David Still, senior director, RAM design at Infineon. “Once we get to the point where we get one electron, we’re done.”

Doug Mitchell, vice president, RAM product line at Infineon, noted that it’s hard to predict when that quantum effect limit is going to happen.

Alternatives to shrinking
In some cases, bit cell sizes have leveled out. 3D NAND, for example, is focused on adding capacity not by shrinking the bit cell, but by adding layers to the 3D stack.

In addition, the existing cells are moving to contain multiple bits’ worth of data. But this is done by taking the range that once served to store a single bit and splitting it up. In a time of decreasing margin, that makes for even less margin, making errors more likely.

“The move from TLC (triple-level cells) to QLC (quad-level cells), or from MLC (multi-level cells) TLC requires better error correction, since the signal-to-noise ratio worsens as the number of bits per cell increases,” said Jim Handy, memory analyst at Objective Analysis.

Fig. 1: Multi-level cells take a given sense range for a single-bit value and further subdivide it for a two-bit value. Each subdivision needs noise margins, so those margins are reduced from those available with single-bit cells. Source: Bryon Moyer/Semiconductor Engineering

Fig. 1: Multi-level cells take a given sense range for a single-bit value and further subdivide it for a two-bit value. Each subdivision needs noise margins, so those margins are reduced from those available with single-bit cells. Source: Bryon Moyer/Semiconductor Engineering

Process variation also is becoming a greater contributor to the need to protect against errors.

“Process variation needs to be accurately modeled and verified from 3- to 7-sigma,” said Sathish Balasubramanian, head of product management for AMS at Siemens EDA. “Running brute-force Monte-Carlo verification for 3-sigma and above is not feasible, since we will need to run millions/billions of simulations. Designers will need to adopt newer methodologies to verify bit-cell reliability.”

Finally, as any memory is made larger, with everything else being equal, the overall risk of an error will go up simply because there are more bits that could be misread.

Communication errors
Once read, a memory value must be transmitted to the memory controller, which is responsible for taking all of the read and write requests from the consumers or generators of data and ensuring they happen reliably.

But communication bandwidth has been increasing, making it more likely that data can be corrupted in transit. That’s particularly true with some of the high-speed protocols in discussion, which include PAM-4 as a signaling format. Just like multi-bit memory cells, PAM-4 takes the voltage swing that used to be used for a single bit and divides it into four. That reduces the signaling margin, raising the likelihood that a bit gets corrupted on the way to the controller.

“We see many test challenges with the PAM-4 data modulation proposed by JEDEC to achieve higher-speed interfaces,” said Anthony Lum, U.S. memory market director at Advantest. “PAM-4 drives the need for multi-level voltage comparators and precision at high-speeds, as well as for lower-jitter clocks for write and read operations.”

Fig. 2: PAM-4 signaling takes what used to be two consecutive single-bit symbols and replaces it with a single two-bit symbol. The corresponding eye diagrams are much smaller, making them more of a challenge to keep open. Source: Bryon Moyer/Semiconductor Engineering

Fig. 2: PAM-4 signaling takes what used to be two consecutive single-bit symbols and replaces it with a single two-bit symbol. The corresponding eye diagrams are much smaller, making them more of a challenge to keep open. Source: Bryon Moyer/Semiconductor Engineering

Some refer to the complete picture — reading a bit cell and then successfully transferring it to the controller — as end-to-end reliability.

Detecting and correcting errors
The best place to detect errors is during chip test. The weakest bits can be eliminated at that point. But even that is getting harder, given the number of bits and the increased communication channel challenges.

That leaves the system to correct errors. In earlier years, simple parity was used. But parity can’t correct errors, and if there is an even number of errors, then it can’t detect them. ECC took over as a more useful approach, despite its greater complexity.

ECC includes a wide array of mathematical ways to deal with errors. The most common type uses Hamming codes, which can correct one error and detect two errors. This “single-error-correct, double-error-detect” approach is often abbreviated SECDED.

ECC has evolved as the technology has matured. “The first generation of ECC at the SoC level was SECDED,” said Synopsys’ David. “The second generation can correct a whole device. The third generation is adding internal ECC, and now the fourth generation of reliability is bounding the faults [dealing with a mathematical anomaly in older ECC].”

While mainstream memories have standardized ECC approaches to ensure interoperability, a lot of discussion goes into deciding how much ECC to provide. “Do you want to do single-bit correction?” asked Still. “Do you want to do double-bit correction? Double-bit error correction hits almost 25% overhead. And do you want to do this on a 128-bit word or on a 64-bit word?”

Critically, ECC protects both data and error code bits. “The algorithm will be able to correct a single bit flip or detect if two bits are flipped in any of the bits written to memory,” said Brett Murdock, director of product marketing, memory interface IP at Synopsys. “This is a must-have capability, as we simply can’t predict which of the bits available for storage will be the one with the issue.”

Dividing up the ECC work
A look at DRAM options helps to illustrate how the chip and controller may interact, with four different approaches.

The most common approach has been so-called “side-band” ECC. With this approach, each memory chip on a DRAM is fully used to store data. Extra chips are added to the DIMM for storing the error codes. This widens the input bus so the data and code can be written at the same time. The controller is responsible for calculating the code when writing data, and verifying the code when receiving a read value.

While this works for some types of DRAM, LPDDR DRAM needs a different solution because it uses a 16-bit bus. The first concern is this makes for a much larger bus if adding side-band memory. Second, the codes are typically 7 or 8 bits, which makes for an inefficient use of a 16-bit memory structure. This is handled by using the same memory chip for data and codes.

This is referred to as “inline” ECC. The controller has to do two sets of writes or reads — one for the data and one for the code, adding latency to each access. Some controllers can pack multiple codes together for sequential data, making it possible to read or write several at once. If sequential data access is common, that reduces the latency caused by the codes.

In each of the above cases, it’s the controller that handles the ECC calculations. “On-chip” ECC is new with DDR5, and it places the ECC inside the memory chip itself. Single errors can be corrected before being sent to the controller. However, if there is an error in transmission, the on-chip ECC won’t catch it. So side-band ECC may still be useful in conjunction for end-to-end protection.

Finally, “link” ECC protects just the communicated data. It’s calculated at both ends of the link and doesn’t involve any stored codes. On-chip and link ECC could be combined to cover end-to-end.

A cyclic redundancy check (CRC) is another option for checking whether data arrived reliably. “As we progress into advanced nodes with higher interface speeds like DDR6 and GDDR6/7, CRC is important,” said Lum.

Fig. 3: Four types of DRAM ECC. (a) Side-band ECC, where codes are stored in a memory chip separate from the data. (b) In-line ECC, where the internal memory of each chip is divided between data and code. For both (a) and (b), ECC work is done in the controller. (c) In-chip ECC, where the data as read is checked with ECC before being sent to the controller. By itself, this doesn’t catch transmission errors. (d) Link ECC, which catches transmission errors, but by itself doesn’t detect array errors. (c) and (d) need to be combined with each other or another technique to provide end-to-end coverage. Source: Bryon Moyer/Semiconductor Engineering

Fig. 3: Four types of DRAM ECC. (a) Side-band ECC, where codes are stored in a memory chip separate from the data. (b) In-line ECC, where the internal memory of each chip is divided between data and code. For both (a) and (b), ECC work is done in the controller. (c) In-chip ECC, where the data as read is checked with ECC before being sent to the controller. By itself, this doesn’t catch transmission errors. (d) Link ECC, which catches transmission errors, but by itself doesn’t detect array errors. (c) and (d) need to be combined with each other or another technique to provide end-to-end coverage. Source: Bryon Moyer/Semiconductor Engineering

Accounting for the costs
ECC approaches can vary widely, but the more capable the approach, the more computationally expensive it is. If done in hardware, that means more silicon area. If done in software, that means more CPU cycles. The cost of that ECC may lie in the memory chip, the controller, or both.

The cost includes the extra memory needed to store the codes. Depending on how that’s done, it means either adding memory or not being able to use an entire memory for data, since part of it will be used for the error codes.

ECC circuits must themselves be tested. Increasingly, that’s being done through the built-in self-test (BiST) port as an extension of the memory array test. “Many ECC techniques are trending toward a BiST implementation,” said Lum. “Others are post-processing the acquired ECC data on the tester.”

Redundancy and repair also help to keep bad bits out of production, although they also come with a die cost. “We’ve done a lot of analysis of repair and redundancy versus ECC to see if we can identify which is better for getting rid of weak bits,” said Still. “For hard failures, repair is the best approach because it’s the easiest to do. We’ve tended to minimize our repair to take care of just the hard bits and then go to a lot more ECC [for softer errors].”

The cost of the ECC circuitry in the past has applied to the controller. While it remains a cost, by putting it in the controller, that cost is amortized over the number of memory chips under the purview of the controller. With DDR5, that cost has moved into the memory chip itself, and so it is no longer amortized.

In addition, there is a fundamental question of where the ECC belongs. “A system architect doesn’t want the ECC built into his chip, because he wants to be able to control it and recognize errors at a system level,” said Mitchell.

The need to protect both bit-cell access and data transmission may result in multiple ECC instances, further increasing costs.

Larger memories have a higher rate of errors, but because error correction codes apply to some smaller chunk of memory — like 128 bits — that gets replicated and shouldn’t increase the cost as a percentage. In fact, for on-chip ECC, cost goes down since a single ECC circuit is amortized over more bits.

This leads to the question of how ECC needs to evolve. As errors become more frequent, then either longer codes are needed, or short codes must be made to protect shorter pieces of data — which has the same cost effect. If bit-cell access becomes increasingly unreliable, then the overhead costs associated with ECC will grow.

Whither from here?
That leaves process migrations as the most likely source of higher error rates — in addition to faster memory connections. For this purpose, taking a single memory cell and using it to store multiple bits has the same effect as a physical shrink. Errors become more likely due to reduced margins.

At some point costs may saved by further shrinking the memory, but that will be met by increased costs for the stronger ECC that will eventually be needed. With as much as 25% overhead in extreme cases today and possible growth in the future, it’s conceivable the cost savings and cost increases could cancel each other out in some future generation. Would that be the end of scaling?

To some, this sounds like yet another end to Moore’s Law — something for which an end-around is instead invented. Memory customers really don’t care how the memory they need is made to work. They simply need ever-increasing amounts of reliable memory at a cost that their application can support economically.

ECC techniques are diversifying to provide better protection — in some cases at higher cost or latency — in applications that require it. Binary protection protocol (BCP) and low-density parity codes (LDPC) are examples that are being used selectively.

Different approaches may impact memory chip power, which can itself impact reliability. “Lower power improves reliability, as it reduces the burden on regulators interfacing with memory die,” said Chetan Sharma, principal engineer, RAM design, at Infineon.

But that can be a double-edged sword. “When we go down the line to save the power, we are playing with the process in the bit cell,” said Sharma. “And once you play with the process, there is a high probability that your margins are going to fall apart. In order to contain them, we try to put circuits around the memory that can boost up the power a bit, boost up the timing, and gain back that margin so that we can still deliver a memory that is reliable. At the same time, we are relaxing some of those specs that the customer may not need so that we can play around with the internal timings to relax read or write cycles and get more reliability.”

NAND flash has addressed the scaling challenge by going vertical. DRAM also may do so in the future, although technologists feel like there is still more more improvement available in the current architecture. That gives DRAM potentially more room before it hits the wall.

Other techniques also are brought to bear. “In the flash world, people started to do things like wear-leveling,” said Still. “Another way to do it for various memory types is to put in scrubbing or refresh cycles.”

“We have a background scrubbing function that we keep to 0.01% of the bandwidth or so — a few times a day or once every few hours,” said Synopsys’ David. “The entire memory is read, and any bit errors are corrected.”

NOR flash isn’t immune from these issues and will need to address them if it is to move beyond its current technologies. “We can improve things with a vertical two-transistor NOR flash bit cell with low power,” said Macronix’s Chih-Yuan Lu in an ITC presentation. “We can also do 3D stacking. And we can put a micro-heater into this structure to make the endurance as long as 100 million cycles.”

Controllers also may become more sophisticated in learning which memory rows may need more or less frequent refreshes. “Maybe five years from now, most of the DRAM will be getting refreshed every 32 milliseconds, but there will be a list the controller has built up of rows that need to be refreshed twice as often,” said David.

Infineon’s Sharma suggested a few other approaches to dealing with increasing bit-cell unreliability. “[Further techniques include] introducing bit flipping/interleaving in the array design and analyzing memory access patterns, using compiler-based methods to optimize read or write timing at different partitions of the memory array,” he said.

Ultimately, memory vendors and customers have different agendas, and a negotiation effectively plays out as new standards are set.

“Users want things that will improve the performance, and vendors want things that will reduce the price,” observed David. “The vendors will push back on each thing that adds cost to die. And the users have to justify why that thing is necessary.”

Find another solution
If the industry runs out of ideas on our current technologies, then it may need to move to something else. Flash has been the darling of the NVM world for a long time, but as its scaling limits loomed, work started on other NVM technologies like PCRAM, MRAM, and RRAM (or ReRAM).

“They start looking at things different than transistors for the bit cell,” said Still. “They look at resistive elements. They start looking at magnetic switching devices and spin torque and FRAM. They start looking at hysteresis and ferro-materials.”

PCRAM already has been commercialized as Intel’s Optane, but cost has been an issue. MRAM also is becoming available, although the big win for all of these memories is in embedded memory more than stand-alone memory.

“New materials, integration schemes, and system designs have been and will continue to be critically important,” said Hoover.

If one technology starts to approach the end of its life, researchers will look to new approaches to replace the old. It’s a gamble, because incumbent technologies have a way of hanging on far longer than originally expected — with 3D NAND being the poster child for that.

That puts new technologies at a severe disadvantage, because they’re at the start of their manufacturing learning, putting them at a cost disadvantage against the incumbent. If the incumbent can establish a new, far-distant limit, then the new technologies may have to be put on the shelf for a while — perhaps forever.

Despite any concerns, history and a long pipeline of new ideas appear to be pushing such a reckoning far out into the future. No one at this point foresees any time when we’ll have to stop, look around, and say, “Well, I guess we’re done with memory scaling!”

Related
Taming Novel NVM Non-Determinism
The race is on to find an easier-to-use alternative to flash that is also non-volatile.
What Designers Need To Know About Error Correction Code (ECC) In DDR Memories
How side-band, inline, on-die, and link error correcting schemes work and the applications to which they are best suited.
Enterprise-Class DRAM Reliability
A drill-down into how data moves and is repaired in memory.



1 comments

Jan Hoppe says:

Can imaging individual bit cells help?
If we can select by test system cell’s location, we can show physical problems of cell’s parts.
The imaging resolution can go to 0.1 nm. So can Sentaurus simulation. And ASML lithography has sub-nanometer accuracy. Applied Materials can see surfaces of chip as it is build with 1 nm resolution.

Leave a Reply


(Note: This name will be displayed publicly)