Reliability Concerns Grow

New process nodes and packaging options are raising unexpected issues for chips; longevity is a looming issue.


By Ed Sperling
Knowing when to signoff on an IC design has always been as much art as science, matching engineering experience with managed risk. As ICs become more complex, however, even the most advanced chip companies are getting things wrong.

Some of this can be fixed through software and some of it can be tweaked with programmable firmware. But some of it may have to be fixed in the next cycle of chips.

“We’re seeing three different scenarios unfolding as we push to lower power and higher performance,” said Arvind Shanmugavel, director of application engineering at Apache Design. One is operational, the second is time-based, and the third is event-based.”

The operational problems enter the picture when areas of the chip are switched on and off. The transitional current can be high enough to cause serious issues. The time-based concerns involve electromigration. Electrons over time damage metal interfaces, which must be minimized by controlling the amount of current flowing to interconnects and creating an accurate model of the temperature distribution across a die. The event-based issues stem from voltage and frequency scaling. Sufficient feedback about temperature is required to prevent wires from burning out.

“The event-based liability is ESD, which occurs regularly in high-performance ICs. But it generally is not discussed in EDA,” Shanmugavel said. “You need to know if there is a hit, how is the IC hit and whether you can reliably discharge an ample amount of current. We’ve had a 2 kilovolt standard for decades, but devices are getting smaller and they’re discharging the same amount of current. Designers need to do true simulation and characterization to avoid wire burnout. The wires are thinner, too, so even the same amount of current can cause problems.”

The laws of physics
Shrinkage brings other problems, as well. Metal spacing rules traditionally were driven by lithography, but physics has taken over.

“If the metal spacing is too small, the electrical field may become too intense,” said Ting-Sheng Ku, director of engineering at Nvdia. “The spacing rules increasingly are dominated by the electrical properties of the dielectric, not the lithography. At 28nm the industry saw some of that. At 20nm the spacing can become a real problem if the industry does not address it.”

Nvidia typically runs at the front edge of Moore’s Law, so the types of effects it is witnessing are brand new to the design industry. Ku said many of these had been predicted, so there is little surprise even though solving these new problems remains extremely difficult. But one area that is emerging is in the area of longevity of parts.

“Transistors get weaker as time goes on,” he noted. “You get impurities trapped in the system and they do physical damage. This is all physics-related, but as features get smaller this gets worse.”

He said more modeling is required for these kinds of effects over time to avoid future problems.

New types of errors in stacks
As if things weren’t hard enough, there are new types of errors being introduced into chips that didn’t exist in the past, particularly in stacked die.

“Just having extra processing steps can generate errors,” said Marc Greenberg, director of marketing for SoC Realization at Cadence. “There also are mechanical issues when you put die together, and you have additional annealing steps from TSVs.”

This is particularly troublesome for memory because DRAMs are somewhat delicate, he said.

And in extreme environments, single-event upsets caused by radioactive particles have always been a risk. But they become an even greater risk as features shrink and more gets packed together, either in planar or stacked configurations. At advanced nodes, margins can affect performance and power. But cutting margin means that if anything goes wrong, there isn’t a failover mechanism.

Verification and other challenges
Even without unexpected outside interference, just having confidence that issues have been fully addressed is a big issue. Verification coverage in a complex SoC is almost impossible to address in a reasonable market window.

“Verification is going non-linear in complexity,” said Aart de Geus, chairman and CEO of Synopsys. “Hardware and software interactions will continue to become more complex because the functionality is the hardware plus the software. In addition to that, we have to keep physics in check.”

The good news is at least tools makers are aware of the problems.

“We are getting this under control,” said Apache’s Shanmugavel. “But reliability has always been the stepchild of simulation. It has not been treated as a functional problem. That will have to change. We will need to use models to create probabilities and rely more on event-driven simulation—and then we will have to string it all together to see if we’re getting a voltage drop over time or a high peak current.”

And if none of these problems is insurmountable or unexpected at 20nm, there surely will be others that arise as companies begin testing chips at 14nm.