Why Chips Die

Semiconductor devices face many hazards before and after manufacturing that can cause them to fail prematurely.


Semiconductor devices contain hundreds of millions of transistors operating at extreme temperatures and in hostile environments, so it should come as no surprise that many of these devices fail to operate as expected or have a finite lifetime. Some devices never make it out of the lab and many others die in the fab. It is hoped that most devices released into products will survive until they become obsolete, but many things can happen that cause them not to make it that far. Even devices that are operating correctly can be compromised to the point where they no longer provide correct results.

There is a lengthy list of common hazards and what causes them. They typically fall into a handful of categories, which are detailed below.

Death by design
Only 26% of ASICs achieved first silicon success in 2018, according to the Mentor/Wilson functional verification study, down from previous study results. Part of the reason for the low success rate is that new technology nodes add challenges that are not fully understood. Issues that have been around for awhile are incorporated into tools and flows, making those known issues less of a threat. Yet for 2018, mixed-signal interfaces, crosstalk, timing and IR drop—all known issues—saw upticks in the rate at which these problems caused respins.

Fig. 1: Type of ASIC flaws contributing to re-spin. Source: Wilson Research Group and Mentor, A Siemens Business, “2018 Functional Verification Study.”

“Some customer chips fail because their design process has been more ad-hoc,” says Kenneth Chang, product manager at Synopsys. “One customer did block-level power analysis and then integration. They thought they could fix problems at that stage. It was unfixable, and the chip was dead. Chips die because old methods are no longer working in the new aggressive advanced technologies.”

It doesn’t have to be non-functional to fail. “It could fail because it did not meet performance targets,” says Jerry Zhao, product management director in the Digital and Signoff Group at Cadence. “If the silicon came back running 10% slower than anticipated, it may not be competitive in the market.”

Power is becoming a challenge, especially when the power is on-chip. “The power delivery network (PDN) is a distributed RLC network that can be partitioned into three parts: on-chip, package and board,” says Lisa Minwell, senior solutions marketing manager for Arm‘s Physical Design Group. “On-chip there is a demand for faster clock frequency, lower voltage operation and increasing transistor density. While advanced finFET technology has enabled a continued performance push, the increased power density makes IR drop closure a challenge. Accurately modeling and minimizing voltage margins is critical to balancing energy efficiency and robustness.”

But margining can be pessimistic, thereby limiting competitiveness. Some companies take a risk and move ahead, despite discovering issues. “A big memory company tapes out with known large IR drop issues,” says Chang. “As long as it doesn’t look too bad, they tape out because schedule is more important to them. Customers are learning and, in this case, their chips have not failed. If they don’t fail, they just keep doing what they are doing. As they get to more aggressive nodes, they will need to become more metric driven and perform EMIR analysis.”

An increasing number of issues have become coupled, too. For example, power, IR drop, thermal, timing, electromigration are all linked and yet analysis for most of these is performed separately. “Power noise is a problem,” says Zhao. “The voltage supply is dropping, and at the same time users want more performance. You do not have much driving power from the battery, maybe 850 mV, but you still want 3 GHz performance. Power noise can have a major impact, especially if there are variations across the die, and this [noise] can vary with time and with location. So the same cell at different locations may fail based on voltage drop, and thus timing delay. You have to analyze cells in the context of voltage drop and do static voltage-aware timing analysis. Some paths can be very sensitive to voltage variation.”

As the problems become better understood, tools perform better analysis and design methodologies can be used to circumvent the problems. “Complexity leads to greater power density, and this in turn creates localized heating (hot spots) within the chip,” explains Ramsay Allen, vice president of marketing at Moortec. “Increased gate density also leads to greater drops in the supply voltage feeding the circuits. High accuracy temperature sensors and voltage supply monitors throughout the design allow the system to manage and adapt to such conditions, improving device reliability and optimizing performance by providing solutions for thermal management and the detection of supply anomalies. This is particularly relevant in datacenter and AI designs, where the increased performance requirements put a huge amount to strain on the design in terms of temperature and voltage.”

Death by manufacturing
The manufacturing of semiconductor devices involves structures that measure just a few nanometers. To put this into perspective, a strand of human DNA is 2.5nm in diameter, while a human hair comes in at 80,000 to 100,000nm. A single particle of dust can destroy several die on a wafer. If the size of the die gets larger, the chance of random failures increases. For a mature process node, yields of 80% to 90% are possible. For newer nodes, however, yields may be significantly below 50%, although the actual numbers are closely guarded secrets.

Fig 2. Wafer defect patterns. Source: Marvell Semiconductor, ITC 2015.

Even die that are not affected in a catastrophic manner may not be considered operational. The manufacturing steps are not perfect and process variation of just one atom can make a significant difference. While that may not have an impact on some parts of the design, if process variations happen to coincide with a critical timing path, it may put the device out of specification.

“As designs evolve into deep sub-micron technologies with advanced packaging, the variability and its impact on reliability is not very well captured in the existing simulation tools and the design methodology,” explains Karthik Srinivasan, product manager for ESD/Thermal/Reliability at ANSYS. “That causes gaps in the design flow, which cause some failures.”

The design flow increasingly enables variation to be taken into account early in development to minimize its impact, and design techniques such as redundancy can reduce the number of “almost working” chips that need to be discarded. “Almost working” chips are very common for large memory arrays. Binning is another practice often used for processors where the best devices that run at higher frequencies can be sold for higher prices, while those that only work successfully when the frequency is lowered are sold at a discount.

It is the role of testing to find out which die are fully functional. Those die that are marginal often get put in the discard pile, but some non-functional die do escape and end up in products.

Death by handling
There are multiple ways to kill a chip. Consider that 0.5V applied to the outside of a chip creates electric fields of 0.5MV/m when applied across a 1nm dielectric. That is enough to make high-tension wires arc. Now consider what happens when you touch the pins of a chip.

“Typically, it is a much higher voltage, and based on the way in which the pins are touched, you have different models, such as a human body model or charge distribution model (CDM) model,” explains Zhao. “Those models define how the current is supplied to the pin. This is a waveform that is dynamic over time.”

Normally, the chip will contain electrostatic discharge (ESD) protection. “For a single die on a package, they target the standards like 2kJ,” points out ANSYS’ Srinivasan. “Multichip solutions, such as HBM go for slightly lower standards. One reason to go to 2.5D or 3D IC is for performance, and ESD is a roadblock for performance. You try to minimize the ESD or even get rid of it on these Wide I/O interfaces or any kind of multi-die interface channels, which means that you cannot really test each die for the same standards that you target for single die. They have to go through a more specialized way of testing because they will have minimal ESD protection, or possibly no ESD protection.”

Even during operation, ESD events can cause problems. “ESD can cause many types of soft-errors in portable electronic products,” says Arm’s Minwell. “During an ESD event, noise can be induced on the power distribution network (PDN) either because of the sensitivity of some of the ICs (oscillator IC, CPU and other ICs) or due to field coupling to the PDN traces.”

Death by association
“Soft errors can happen in many ways, and if those are systemic to the design, it can make it appear as if the chip is not working. 3D IC is increasing the need for an electromagnetic-aware design methodology,” says Magdy Abadir, vice president of marketing at Helic. “This is because of the higher power densities created and the increase in the number of layers in the stack, which create higher risk of the creation of antennas that amplify the magnetic fields created throughout the design.”

Weak power supplies also can be problematic. “The chips function depend upon the transistor making the transition,” says Zhao. “That is dependent on the supply voltage. If it is operational at 1V, it can probably go down another 10% or 20% and still be functional. But the timing will be different, and thus the maximum clock frequency may need to be lower.”

As voltage is lowered, the circuit because more susceptible to noise. “Electromagnetic interference (EMI) is the noise coming out of a chip to the environment,” says Norman Chang, chief technologist for ANSYS’ Semiconductor Business Unit. “The source of the noise comes from the active circuitry, which will generate current on the power ground wires and on the signal wires. The power/ground wires will go through the package to the PC board, and if it sees an antenna structure on the package or board, will cause through-the-air radiation, and then through the antenna structure will radiate as interference to the environment.”

But what goes out also goes in. “Electromagnetic susceptibility (EMS) is a new problem that people have to worry about,” Chang notes. “A power injection test is the injection of 1W from 150kHz all the way to 1GHz. At each frequency you will inject energy of 1W into the system. If you do not have enough protection you will destroy the circuits along the path into the chip. The goal is not to destroy the chip, but to test if this noise will affect the circuit. Or the voltage at the pin may be too high, and if the voltage is too high you get electrical over-strain.”

Death by operation
At this point, a chip has reached the field and has been deemed to be operational. “Reliability is a big concern,” says Fionn Sheerin, principal product marketing engineer for Microchip’s Analog Power and Interface Division. “In a lot of cases, poor thermal design does not result in instantaneous catastrophic failure or even mediocre products. It is products with short device lifetimes. Watching for hotspots in the layout or best layout practice and good floor planning can make a difference. It is also where your verification and reliability testing really matter. This is also an issue for functional safety with automotive applications.”

Joe Davis, product marketing director for Mentor, a Siemens Business, agrees. “Heat causes more problems than just that your phone gets hot in your pocket. It causes degradation in the transistors and in the connections between them. This can affect both performance and reliability.”

Heat gets generated form two sources. “First in the routing layers,” says Zhao. “That is the heat related to the current flow in the wires. Analog circuits have larger currents than digital. So analog designers have to worry about if temperatures go too high such that it will melt the wires. The second source is the transistors. When we went to finFET, one of the new phenomena is self-heating. Heat follows the weak resistance path, which is vertically escaping from the fins of the transistor. This adds to the heat in the wires.”

When high currents and heat come together, electromigration can slowly damage the wires. Similarly, physical effects such as negative bias temperature instability (NBTI), where you have large charges, can stress devices and if held for long enough can lead to permanent damage.

This article contains just a few of the challenges that chips face in getting from the drawing board to products and then surviving for the lifetime of the product.

Chips operate in a hostile environment, and the semiconductor industry has learned how to deal with these challenges. But as fabrication dimensions get smaller or new packaging technology employed, new problems emerge. At times, these new effects will cause devices to fail. But historically, the industry has learned quickly to either circumvent new problems or find ways to minimize them.



Related Stories

Taming NBTI To Improve Device Reliability

Transistor Aging Intensifies At 10/7nm And Below

Bugs That Kill


Barbara Kalkis says:

Come January I will have been in the semiconductor industry 39 years and have spent most of that time with an ASIC focus, including the ASIC engineering classes I took. Based on that experience, here is my view:

Brian’s article espouses the view of EDA and packaging & assembly companies. These folks are trying to fix the problems described by selling software. Will it solve the problems completely? I have doubts, for the simple reason that it adds another layer to the design communications process.

From its inception to the late 1980s, we were a vertical industry. ASICs and other chips were designed by the same company that manufactured, tested and (sometimes) packaged the chips. There was a very tight interface during the design process between the ASIC company and the customer. About 1985 (give or take a few years), when I was at AMI Semiconductor — an established ASIC company – we handled the design, manufacturing, packaging and test for the first cochlear implants. They were and still are a medical success.

Skip forward to the early 1990s. The foundry model had taken off, but companies like VLSI Technology still worked closely with companies in the same ASIC design-through-manufacturing/packaging/test model. However, the fabless model also mushroomed and the integrated device manufacturers (IDMs) went through vertical disintegration and became a group of silo specialties.

It’s true we see many first-time-manufacturing failures because of ‘basic’ design issues. But I think the main reason is that we have become an industry of specialists with disparate voices and views. We lack cohesiveness between teams at each stage of development because each team represents a different company and a different philosophy. It’s like going to 10 medical specialists about a stomachache and getting 10 different opinions, instead of just going to a general practitioner who takes a holistic view.

By keeping the tasks under ‘one roof’, the vertical-integration model utilized managers who oversaw the entire ASIC development process and could solve problems internally. Is it any wonder why the cost of ASIC development is so high, when there are multiple teams from multiple companies trying to stay in step with each other?

Every company has its own design philosophy. Ditto for software developers, foundries, and packaging, assembly, and test service providers. Designers working with a tight link to their external teams creates the recipe for problems and, when failures occur, a search for cause amongst the participating companies.

Military programs followed the old ‘mil-standard’. The semiconductor industry would do well to have more standards. SEMI has committees for this, and I believe they should be supported by the industry.

I would also propose that moving toward vertical integration is the way to go for ASICs. Apple, Google and others now have internal ASIC designers to handle chip development tasks. In doing so, they have begun the drift toward verticalization to control quality as well as to protect their IP.

The EDA and packaging industries may be able to sell more software, but I question whether the specialist-model can significantly improve first-time-success rate of increasingly complex designs.

Two thoughts come to mind: (1) Too many cooks spoil the broth. (2) The entire idea of an ASIC is as it always has been: one design by one company for one customer. Period.

Gavin Rider says:

I have been involved in the semiconductor industry’s Standardisation activity (through SEMI) for over twenty years. The problem I find tends to be that many semiconductor companies don’t bother to adhere to the Standards! SEMI Standards tend to be used by some chipmakers as clubs with which to beat up their suppliers – but then if a problem subsequently is found to be due to the chipmaker itself not following the Standards properly, they generally won’t bother to correct anything.

It would be nice if the semiconductor industry actually adopted the Standards that have been developed for them!

rcgorton says:

A number of years ago, I went for a job interview at Unisys (Sperry). If I recall correctly, their processors had last been produced a number of years prior (ASICs?), and they were ‘banked’ in a vault. One of the interesting tidbits from the interview was that there was a non-trivial failure rate of the remaining chips simply due to radiation. That is, these processors passed qual/test/burn-in prior to being ‘banked’, but would fail upon being installed in a machine.

Leave a Reply

(Note: This name will be displayed publicly)