Quality Issues Widen

Rising complexity, diverging market needs and time-to-market pressures are forcing companies to rethink how they deal with defects.


As the amount of semiconductor content in cars, medical and industrial applications increases, so does the concern about how long these devices will function properly—and what exactly that means.

Quality is frequently a fuzzy concept. In mobile phones, problems have ranged from bad antenna placement, which resulted in batteries draining too quickly, to features that take too long to load. These are considered inconveniences, and not all that surprising for first-generation devices. If a phone doesn’t work after five years, that isn’t surprising.

It’s far less ambiguous when a battery overheats. The Galaxy Note 7 has received the lion’s share of press due to the size of its recall, but this is a recurring nightmare for many vendors. HP recalled 101,000 laptops in January due to battery problems. Even Boeing experienced fires caused by batteries overheating when its 787 Dreamliner was introduced. So did multiple vendors’ hoverboards last year. In fact, the problems are so widespread that the University of Texas engineering school this year added a course entitled “Introduction to Lithium-ion Battery Thermal Safety and Fire Hazards.”

Fig. 1. Overheating lithium ion battery. Source: University of Texas

Many other issues go unreported, however, because they are either fixed in software or they cannot be replicated.

“If you look at some of the big failures, it can be three or four things together that cause the problem,” said Kiki Ohayon, vice president of business development at Optimal+. “It can be an issue with design, with end usage, or with manufacturing. Even if you have a design defect, that implies the marginality to fail is higher, but it does not mean all of the devices will fail.”

That makes it much harder to figure out when a part is defective because it frequently isn’t defective all the time.

“It’s no longer just pass/fail,” said Ohayon. “If it’s okay for one type of usage, it may not be okay for another.”

Nor is it always confined to the hardware or the software. Sometimes it’s a unique combination of factors that causes a problem.

“The challenge when you look at quality is that it’s not just the quality of the chip,” said Bill Neifert, director of models technology at ARM. “It’s the overall system, and that creates a segmentation problem. So there are things you care about in each design, but for automotive, industrial and enterprise computing those are all different. Some involve different hardware. Others involve different applications of the same hardware. And if you’re dealing with safety-related issues that require documentation, a solution in that space is about software processes as well as the underlying hardware. Quality is only as good as you can prove it to be.”

At advanced nodes, the proof is confined to test chips and simulations, which are not a guarantee that something will work as planned. Reliability is a function of quality over time, and the only way to guarantee a device will work is by collecting and analyzing data during its lifespan.

“Companies increasingly are looking at whether a device is silicon-proven,” said Ranjit Adhikary, vice president of marketing at ClioSoft. “It wasn’t that way two or three years ago. Especially at the lowest process nodes, they’re looking for IP that is proven in silicon or product-proven. We’re seeing a lot more companies doing test chips than in the past. With digital IP, the process, frequency and qualification are more important. With RF and analog, they define the spec and they want to know about any deviation that can affect performance.”

Segmenting quality
Quality can vary greatly by application. Even definitions of quality can vary from one application to the next.

“You’ve got an array of solutions, and each one has a particular purpose and a particular tolerance,” said Christopher Lawless, director of external customer acceleration at Intel. “You can validate until the cows come home, but you would miss the window and never make any money if you couldn’t get it out the door. So there is a tradeoff that has to happen. Before you could have been cocooned between the walls. Now we’re reaching out to customers to understand a particular usage and trying to overachieve. It has to be good enough and appropriate for every application out there. But those vary depending upon the different tolerances and market requirements.”

Tolerances vary by what’s important to a particular application in a particular market.

“A lot of this is about risk tolerance,” said David Lacey, design and verification technologist for servers and advanced development at Hewlett Packard Enterprise. “We might take additional risk on pre-silicon validation, so hopefully we can get things into the lab and maybe have the chance to get a product out earlier. We don’t ever want to look at a product as ‘good enough,’ but it also comes down to the different features. If we have a node controller chip where the coherency doesn’t work, we cannot ship that. But there may be logging features for debug to help us understand what has gone wrong in a huge system. If that doesn’t work exactly like we wanted it to, we may be okay with sending that out as long as we understand how that’s going to impact the ability to debug it.”

In the automotive market, industry standards such as ISO 26262 set up a chain of events that need to be documented. “There are certification requirements and steps that you need to take,” said Neifert. “To comply it has to be certified in a suite of requirements.”

These kinds of requirements have been in place in the mil/aero and medical markets for years. But car companies have little experience with semiconductor technology for safety-critical applications.

“Five years ago, ISO 26262 was not a semiconductor company’s problem,” said George Zafiropoulos, vice president for solutions marketing at National Instruments. “It’s only recently that it got pushed down to the semiconductor guys. In this market, everyone has to deal with traceability for functional safety. That will move to the consumer market, and ultimately it could be a way to improve quality everywhere. With safety-critical systems, devices need to fail gracefully.”

Safety-critical markets also are beginning to include security in the mix of requirements because that can affect safety, as well.

“People are trying to address safety through standards, although it’s interesting that this group does worse in achieving first-silicon success and quality,” said Harry Foster, chief verification scientist at Mentor Graphics. “Security is a new area. People are starting to recognize that it is not easy to tackle. It’s not just about registers not being hacked. It’s also about how you know someone has not injected extra circuitry in a design.”

Screen Shot 2017-03-04 at 11.09.09 AM
Fig. 2: First-silicon success is significantly lower for safety-critical designs. Source: Mentor Graphics/Wilson Research Group

What can go wrong

One of the biggest challenges with quality is zeroing in on exactly what went wrong. Battery fires in mobile phones and laptops, for example, only occurred after a certain sequence of steps. Otherwise all of those devices would have overheated, and the problems would have been identified before shipping.

“Companies need to modify data in simulation based upon how something is really performing,” said Optimal Plus’ Ohayon. “And you need to address that with all of the key components, You’re collecting the device DNA for every chip and every board. That helps you in other ways, too. You create an index that goes beyond pass/fail to smart pairing. If you pair this device with this device, what is the best combination?”

That requires a lot of test data, which is used to establish correlations between what works and what doesn’t work sometimes. But as complexity rises, so does the amount of data, which makes it harder to find the problems.

“As you collect more data, everything gets more complex,” said Radek Nawrot, software product manager at Aldec. “You’ve got assertion coverage, toggle coverage, code coverage and functional coverage. But at the same time, everyone wants more speed, so as complexity grows it’s very important to provide solutions in a different way. One improvement is to cut some parts of the testing that are not necessary.”

Done right, this approach actually can improve reliability because it frees up time and resources to allow companies to focus on those tests that really matter for a specific application. Like everything else, test is becoming more complicated as devices get more complex. The ability to make sure everything in a design is properly verified, validated and tested becomes more difficult if there is more to test, because the time allotted to testing doesn’t change. By drilling down into how a design will be used and what can be required, significant gains can be achieved in time to market.

“Maybe you can reduce the number of regression tests,” said Nawrot. “You can decide to look at a certain number of test scenarios or time of test, and you can prioritize according to market. So for commercial and aerospace markets, functionality is given high priority. But you also can take the same tools and the same flow and change the order for some things.”

Screen Shot 2017-03-04 at 11.10.36 AM
Fig. 3: Typical reasons for re-spins. Source: Mentor Graphics/Wilson Research Group

At the most advanced nodes, physics plays a growing role in defining quality.

“Die size is increasing while feature structure size and voltage levels are decreasing,” said Raik Brinkmann, CEO of OneSpin Solutions. “So you need less energy to create an issue. That requires more error correction and TM (triple modular) redundancy. But it also makes it harder for design and verification.”

Brinkmann noted that the situation is somewhat relieved by machine learning because noise is part of the algorithm and robustness is built in, so certain type of faults would be just noise and cause no harm.

Cramming more features into a device exacerbates these problems.

“Every generation of cell phone technology is more complex than the last,” said NI’s Zafiropoulos. “The number of radios that are in simultaneous use, the number of state spaces in the digital part, and the analog circuitry complexity all increase the potential for something going wrong. There are soft failures in such things as power management. But the really hard problems, particularly with a 5G phone, involve more potential for simultaneous transmitters. That makes test more complicated. You need to test combinations of things.”

In the past, a cell phone modem operated in a single channel. Now while a phone is communicating with a base station, it might be connected to another device such as a car radio or wireless headphones over Bluetooth. And in the future, that data is expected to be more fragmented, relayed over different bands and then re-aggregated at the end.

Shift left, extend right
Verification plays a critical role in all of this, and the amount of verification required for complex designs has been growing steadily. Two years ago the buzz term was “shift left.” It now is being extended further right, as well, beyond manufacturing. In consumer markets, for example, it’s not unusual to fix unexpected design flaws with a software patch after the device is already in use.

“Good enough is something that can be made workable with software workarounds,” said Frank Schirrmeister, group director for product marketing of the System Development Suite at Cadence. “With software and a working set of errata, you can make a chip work. In the verification plan, you have a confidence level. Have you covered all the items you need to get the product out the door? That includes things like the power needs to work. It needs to boot the software. And it needs to run the basic tests.”

That doesn’t mean there is less effort devoted to verification, of course. Sales of expensive emulation and FPGA prototyping hardware have been growing steadily, and not just for application processors for mobile devices.

“We’re seeing new verticals with emulators,” said Jean-Marie Brunet, director of marketing for Mentor Graphics’ Emulation Division. “We’re also seeing an increase in co-location. Today a lot of very large semiconductor companies are using emulators in co-location or co-hosting situations.”

The speed of the hardware-accelerated tools is increasing, as well, meaning more can be done more quickly. But that doesn’t necessarily mean faster time to market. In many cases it’s required because there is more to verify, validate and debug.

The definition of quality is changing. It is becoming more granular, more time-dependent, and more market-focused, forcing companies to take a harder look at what they verify and test and how much time they need to spend on what’s important to a particular application.

Even big tech companies don’t always get this right, but given the high price of failures and recalls, they are being forced to do a much more extensive review of what can possibly go wrong.

Leave a Reply

(Note: This name will be displayed publicly)