Reliability Definition Is Changing

Complexity and vulnerabilities in systems are raising questions about what constitutes a fully functional design.

popularity

Since the invention of the integrated circuit, reliability has been defined by how long a chip continues to work. It either turned on and did what it was designed to do, or it didn’t. But that definition is no longer so black-and-white. Parts of an SoC, or even an IP or memory block, can continue to function while other parts do not. Some may work intermittently, or at lower speeds. Others may increase in temperature to the point where they either shut down or slow down.

This raises some interesting issues across a wide swath of the electronics industry, ranging from legal liability, to design goals for power and performance, to what differentiates any functioning system—especially critical systems—from those that are non-functional or marginally functional. Just as systems are getting more complex, so are the metrics surrounding them.

“This is no longer just about aging of a system,” said Frank Schirrmeister, group director for product marketing for the System Development Suite at Cadence. “We’re starting to see more generalized questions like whether the system is meeting the expectations set at inception. If you have thermal issues, the system may be doing what it’s supposed to do. There is complex logic to turn it on and off and you’ve done performance validation to make sure it all works. But if you’re running a heavy compute load for six minutes, with thermal effects it may slow down. So the processor is actually going down in performance over time. How long can you run that before the performance is not what you expect?”

The bigger picture
Reliability measurements don’t stop with a single device, either. Increasingly it involves one device connected to one or more other devices, and reliability may be as dependent on those other devices as the design of the initial device. Consider smart cars, for example, which can communicate with other smart cars to prevent collisions around blind curves. But what happens if one of the other cars fails to communicate and alert the oncoming car? That also can happen even if both communication systems are working, but one car is newer than another and uses a different communications protocol.

“Embedded devices are changing in the way they are put together,” said Serge Leef, general manager for the system-level engineering division at Mentor Graphics. “In the past, it was all in one. You had hardware, storage, and software, which includes the real-time operating system or other operating system, middleware and the application. But it’s becoming clear that embedded devices of today and tomorrow will be different. EDA has been focused on just the box, but that’s not plausible anymore. You have to solve the big picture.”

That big picture extends well beyond the device being designed and tested, even for something as simple as a smart garage door opener, which can be controlled by a smart phone over the Internet. “The device now has three elements—the edge node, which is under mechanical or local control, the central node, which interacts with the edge node, and the application that runs on the mobile client and interacts with the device over a hub.”

A problem in any one of those areas can affect reliability in the other two. And when problems do occur, it can be difficult to determine where the fault is. It might be the hardware, it might be the software that controls the hardware, or it might be in the communication infrastructure that is out of the control of everyone involved in creating the device. And it may be temporary or permanent.

Another factor that can affect performance—and therefore device reliability, as in the case of two cars communicating through the cloud—is data access. Steven Woo, vice president of solutions technology at Rambus, cites a report by IDC—the most conservative one he could find—which predicts that digital data will grow 44 times between 2011 and 2021.

“You have to search through more and more data, and memory has to improve,” Woo said. “The link performance is the limiter in a data center. Compute and I/O need to improve, as well.”

This may not affect reliability of the data center, per se. It’s hard to call a performance reduction caused by a jump in data a reliability issue, unless there are response-time guarantees in place for customers of that data center. But it certainly can make it harder for devices that depend on fast response as part of the design to meet their performance goals. Think about the example of two cars coming around a blind curve at high speed, for example.

Where tools fail…sort of
Much of this fits into an as-yet undefined gray area. What’s not so obvious to the outside world, and sometimes even in the design space, is what happens when tools for developing chips evolve into their own gray area.

“Less well understood is the result of incomplete functional verification,” said Bernard Murphy, chief technology officer at Atrenta. “Clock synchronization is never a perfect solution, which can decrease the mean time between failures. It is rarely mapped out in terms of consequences for the whole SoC. There are timing exceptions. And we never know just how incomplete functional verification is. But it’s almost impossible to get 99% coverage. We are starting to see an interest in quantifying the incompleteness for large designs, though.”

Murphy noted there are also issues when devices are used across markets. So big application processors may not need to last more than a few years, but that same technology needs to last 10 or 20 years if it is included in an automotive infotainment system. It also has to work in much more harsh environments.

At the leading edge of design, things get even more confusing. Consider IP development at 10nm, for example.

“Due to aggressive time to market schedules, customers are expecting guarantees that these IP blocks work on the first instantiation,” said Navraj Nandra, senior director of marketing for Synopsys’ DesignWare analog and mixed-signal IP. “This necessitates correlation between SPICE simulations and silicon characterization data of the fundamental IP building blocks such as transistors, capacitors and resistors of various aspect ratios. A statistically meaningful number of devices must be chosen to ensure the simulation to silicon correlation, with different layouts and density dependencies providing data for resistor/transistor matching and metal mismatch due to double patterning and triple patterning of 10nm.”

It’s possible to get some insights into all of this. Tools such as ring oscillators and operational amplifiers can provide an early indication of gate delay and analog performance of the 10nm process. And current best practices include overstressing devices to evaluate the impact on reliability due to negative and positive bias temperature instability, hot carrier injection and electrostatic discharge. But Nandra noted there also are a couple of persistent technology challenges—ensuring working IP silicon developed with early versions of foundry design kits, which are almost constantly in flux, and meeting PPA requirements. Both are part of the reliability equation.

Security
And finally, even if all of the technology works as planned, there are gaping security holes in designs at every process node and in almost every IoT design—even if it’s only what a well-design piece of hardware or software is connected to. It’s obvious a device that is compromised is no longer reliable. But a device that can be compromised isn’t reliable, either.

There is a frenzy of activity in the security world these days, stemming from breaches at banks and retailers, to government countermeasures for cybercrime and cyberwarfare. It has steadily been spilling over into the M&A world, as well, the most recent example being Arm‘s purchase of an Internet of Things software security company for its embedded microcontrollers.

All of the processor companies have been active in securing their cores. ARM already had its TrustZone technology for compartmentalizing memory and processes. Intel has taken a similar tack with its processor architecture, restricting access to the core architecture. And both Synopsys and Imagination Technologies, which make the other popular processors, have taken steps to seal off the processors from intrusions. But the real challenge is that the IoT opens communications through many channels—from I/O for connectivity to multiple networks to multiple layers of software to on- and off-chip memory and storage.

Even antennas are subject to security breaches, and there are an increasing number of them inside any connected electronics. “There are three things to consider here,” said Aveek Sarkar, vice president of product engineering and support at Ansys. “One is interference. The second is coupling. The third is susceptibility. All of this can do damage on the chip.”

Conclusion
While engineers can only control their own piece of the ecosystem, reliability may be harder to define within those bounds in the future. Each piece needs to be engineered as best as it possibly can, but that doesn’t mean it will be reliable in the real world because there are so many other factors that can affect it.

“You can have a part that is 99.9999999% reliable, but the failure may come from something that is 99.9% reliable,” said Drew Wingard, chief technology officer at Sonics. “The question is which ones are most likely to affect you. Even with ISO 26262 (Road Vehicles – Functional Safety), there is very little focus on repair or extension of life. There is a report before something bad happens, but there is nothing about recovering from errors or correcting them.”

Wingard said reliability needs to be looked at in terms of how something could fail, the quality of suppliers, the reliability of transistor wires, the oxides used and how they break down, logical layers, soft errors, and what happens with software bugs.

But even with all of that, it still may not be enough to keep things up and running. In the new world order, where everything is connected to something else, reliability increasingly may be seen as a relative term, one that depends on the setting, what it’s connected to, and who’s using it. And depending on use models, it may vary greatly from one person to the next, and one company to the next.