Soft Errors Create Tough Problems

Single event upsets become more common as processes shrink and density increases; memory and logic both affected.

popularity

By Ed Sperling

Single event upsets used to be as rare as some elements on the Periodic Table, with the damage they could cause relegated more to theory than reality. Not anymore.

At 90nm, what was theory became reality. And at 45nm, the events are becoming far more common, often affecting multiple bits in increasingly dense arrays of memory and now, increasingly, in the logic. Known alternatively as soft error rates, these errors increasingly must be accounted for in designing SoCs, FPGAs, embedded IP and memory chips, adding to the cost and the complexity of these devices and straining power budgets with error correction technology.

“About two years ago most of the system companies, when they handed down the spec, there were a few lines of code in there called SER, or soft error rate,” said Tom Quan, senior director of EDA and design service marketing at TSMC. “It used to affect the RAM more, and you had to put in error correcting. The issue now is the logic. It’s so dense already, and it’s going to get denser as we go to 28nm and 22nm. “

Already, the problems have moved beyond a single-bit error. Olivier Lauzeral, president and general manager of iRoC Technologies, an independent testing firm, said the level of single-bit errors has remained stable as manufacturing processes moved from 130nm down to 45nm, but the number of multi-bit upsets has risen dramatically. That creates an even bigger problem. While it is possible to correct for single-bit errors, it is not possible to detect more than two bit errors at a time or correct more than one.

“The mechanism we are dealing with is that charged particles travel through silicon for a certain range before losing their charge,” Lauzeral said. “In the memory, a zero or a one is held by a small charge, which is the critical charge. If a particle deposits its own charge, it can flip the one to a zero or a zero to a one. At 65nm, the charge is 1.1 volts. At 45nm, it is 0.9 volts.”

Sources of the problem

There are two known sources of soft errors: One is caused primarily by alpha particles emitted by decaying radioactive elements while the other is caused by stray neutrons, which are present in great abundance. As the voltage and capacitance have been reduced in conjunction with the finer geometries at each process node, the destructive power of these particles has increased proportionately.

“Flip-flops were no problem before 130nm,” said Lauzeral. “As we go to 90, 65 and 45nm, the failure rates are increasing.”

SRAM is particularly sensitive to soft errors because the way the charge is stored. In fact, that is the primary reason why Actel has shifted away from SRAM to flash memory. In FPGAs programmability has to reside somewhere, and historically that’s been in SRAM.

“With flash, you can’t get enough charge on a floating gate to cause an error,” said Mike Brogley, product marketing manager at Actel. “SRAM is more sensitive. You can do some things to mitigate it, but it’s difficult to protect from the SRAM effect without sacrificing area and performance.”

Xilinx, which has been working on the problem for nearly two decades, has developed an epitaxial layer on a heavily doped substrate for parts used in outer space, where highly charged alpha particles can wreak havoc on electronics. Gary Swift, senior staff engineer at Xilinx, said the real concern in space is latch-ups or gate ruptures, which can destroy a device.

“In space, you should monitor your configuration continuously,” said Swift. “For commercial applications, Xilinx provides reference designs to do the same thing. Most of our commercial customers are happy with detection of single event upsets. In the Virtex-5 devices, you also can correct a bit error autonomously. But even as we scale to future nodes, these events are relatively infrequent at sea level.”

Swift said that most people are comfortable re-booting their computer or cell phones a few times a year, if there is an upset, but their tolerance would be severely tested if they had to do it once a day.

Particle physics

Studies of alpha particles began with atomic bombs in the 1940s. The field of study shifted to cosmic radiation in the 1970s because there was concern that spacecraft and airplanes could be affected by the more highly charged alpha particles in the upper atmosphere. The study was expanded even further to include terrestrial neutrons (as well as protons) in the 1980s by computer systems vendors, which were concerned about the reliability of their systems.

By themselves, neutrons cannot affect the systems because they carry no charge. But they can merge with other neutrons to create a heavier nucleus, which does have a charge. Error correction in DRAM was one of the first attempts at addressing this problem in the semiconductor world. In addition to stray neutrons, some DRAM packaging contained trace levels of decaying radioactive material from elements such as thorium, polonium, radium and uranium, which in turn produced alpha particles.

All of those effects proved manageable at 1 micron and above. But as Mentor Graphics chairman and CEO Wally Rhines has said publicly, at deep submicron geometries the laws of physics don’t change, but they are more rigorously enforced.

Where the problems strike

“We’ve got several customers in the telecomm space who were seeing system-level failures from these effects,” said Actel’s Brogley. “In an FPGA, these can change the circuitry. They can disconnect a customer or completely change the system. It could be benign or it could be a basic change.”

That threat is less severe in ASICs, where everything is hard-wired, and it rarely happens in software. In fact, Lauzeral said software can be used to mitigate the problem. “You can use an algorithm to protect data. In network communication, for example, when a packet is corrupt you can recall it. But if you get corruption in the addressing of where that packet needs to go, you have a problem. If data is corrupted it’s less of a problem than if it’s the base station for cell phones or the braking or driving system of a car.”

In most cases, there are workarounds. Even in packaging, impurities can be removed for a significant amount of money. But increased density—caused by more bits in a smaller space, even though the bits increasingly are smaller—is elevating the problem out of pure research and into mainstream designs. What remains unclear is just how much extra these workarounds will add to cost, particularly in sensitive consumer markets, and what the overall effects will be as we progress from one process node to the next.