Defining Reliability In Low-Power Designs

There is little agreement about what constitutes reliability, and perhaps even less understanding about what makes designs unreliable.


By Ann Steffora Mutschler
Having a clear understanding of what reliability means for a particular low-power application can make a significant difference when it comes to communicating with engineering team members and customers. Is reliability simply a question of how long a device can run without errors? And what happens to reliability when power modeling, verification and other design techniques are utilized?

As Massimo Sivilotti, chief scientist at Tanner EDA pointed out, “These questions are complex, and there is no universally accepted answer to any of them.”

In general though, low-power designs involve both architectural and circuit design components and issues such as sub-threshold leakage currents, upsets due to substrate- and power-supply-coupled noise. Device parameter variations due to statistical process factors for deep-submicron devices become more acute as power levels fall. As such, state-of-the-art device models, up-to-date model parameters from foundries, and data-driven noise calculations become essential.

From Intel Corp.’s perspective reliability is more an attribute of the nature (or use model) of an application – whether it is low power or not. “For example, a low power smart phone application would define ‘reliability,’ both from device and user perspective, very differently than an equally low-power battery-powered medical device that administers medicines to critically ill patients,” said Pranav Mehta, chief technologist for Intel’s Embedded Communications Group. “Having said that, low-power designs do offer special challenges to designers. Balancing the need to lower the operating voltage to reduce power while trying to achieve competitive performance provides significant challenges in terms of process technology recipe, architectural tradeoffs, as well as design tool chain and methodology selections.”

The core of the problem
Diving down, technically speaking, Srikanth Jadcherla, group director of R&D for Synopsys Inc.’s Verification Group, noted that reliability in low-power design goes back to the fundamentals – avoiding permanent or temporary dysfunction of the device due to physical effects such as electromigration, self heating and rail/signal integrity failures. While these might have been overlooked before, the causes of the failures or in some cases the magnitude of certain phenomena can no longer be ignored.

“Some of these cause IC designers to adopt a certain power mitigation (or current mitigation) technique,” Jadcherla said. “Some of these are caused by what is done for power reduction. So, it cuts both ways. Specifically, as the industry heads into nanometer designs, current magnitudes are rising while wire cross sections are shrinking – increasing current density dramatically. This puts a lot more stress on the wires from an electromigration point of view and also from a heating standpoint. Ditto for leakage, which increases the average amount of current flowing through the wires irrespective of activity. This issue didn’t exist before. To combat these issues, IC designers have adopted aggressive techniques such as power gating and voltage scaling to opportunistically reduce the current draw.”

Docea Power, based in Moirans, France, looks at reliability in low-power design from the system perspective. CEO and co-founder Ghislain Kaiser said high power consumption affects reliability of electronic systems due to thermal dissipation and electrical issues induced by high-density currents.

There are multiple reliability issues related to high temperature including physical stress on the package, especially on die-attached material; transistor and interconnect deterioration; alteration of transistor switching time, hence timing hazards; thermal runaway risk when leakage current becomes significant; and high temperature that may require cooling systems such as a fan, which increase the risk of reliability if a failure occurs in the cooling system.

But, Kaiser noted, high-density currents alter electrical properties by causing such issues as electromigration of metals atoms along conductors; crosstalk, which degrades signal integrity; or a voltage drop along resistive wires. “This last point is particularly important when a low-power approach like voltage scaling is used. Lowering voltage allows you to reduce power consumption, but it increases the risk of going below the working point of transistors. The design work involves correctly sizing the voltage margin regarding the use cases,” he said.

Jameel Hussein, Technical Marketing Manager for Xilinx Inc.’s Power and Configuration Solutions reiterated that consideration must be given to thermal management at both the component and system levels to ensure that all devices are operating within their specified temperature range and to maximize overall system reliability.

“The device’s operating (junction) temperature is a function of the device power, its ability to transfer the resultant heat to the surrounding environment via the component packaging, and the ambient temperature of the system,” Hussein said. “Reducing the device power consumption, therefore, has two significant benefits. First, it lowers the system cost by enabling the use of less expensive thermal solutions to keep the device in its intended operating range. Second, reduced power means lower operating temperatures, which directly translates into improved component and system reliability.”

Added Hussein: “The temperature is a function of the power so if you can lower the power, you can lower the temperature of the actual device and its surrounding parts. Equation 2 is based on the acceleration factors between the two different devices in this example. If it is a difference of 10 degrees, in junction temperature, the equation shows that a device that runs 10 degrees less on a junction temperature will last twice as long as one running 10 degrees hotter,” Hussein explained.

Actel, which has been the low-power leader in the FPGA space, has focused part of its reliability argument around on-chip memory. Unlike other FPGAs, Actel’s use flash memory, which is less susceptible to single-event upsets caused by either terrestrial or cosmic radiation. And while that’s of obvious importance in aerospace applications, it’s also considered important in critical functions such as automobile powertrains because upsets often affect multiple bits at increased densities. That may be enough to shut down a chip permanently.

There are workarounds in circuitry and software for these kinds of problems, but they add more area to the circuitry and raise the overall power consumption to make sure there are no problems.

New techniques impact low-power design
With designs today utilizing techniques such as power modeling and complete coverage verification there are pros and cons as to the impact on the design.

“Power modeling and advanced verification techniques have definitely improved the ability to hit the projected performance/power curve for a specific design. However, at the end of the day, it still comes down to understanding the target application usage model and using the modeling techniques to tune the design appropriately. Without it, one may still come up with an impressive looking data sheet that really doesn’t cut muster in real application,” said Intel’s Mehta.

In addition, Synopsys’ Jadcherla explained, some of the techniques adopted such as power gating and voltage scaling themselves cause new problems. “First, IC designers really need to now analyze each physical region (island) by itself independently, unlike the entirety of the chip. And they need to do this across all the temporal situations (aka states and transitions) that are likely to occur. Second, the very act of moving voltages adds new irritants into the integrity of rails and signals – the collapse of either can cause temporary failures or permanent device breakdown.”

Another consideration of using advanced techniques is that the architecture team has to model and evaluate the benefits of various low power techniques regarding the use cases targeted by the final application. This leads to defining the various voltage and clock domains, Docea’s Kaiser said.

Finally, a new entrant into this drama has been temperature, Jadcherla said. “Cross die variations are exacerbated by low power designs. Perhaps one part of the chip is mostly off (cool) and another is mostly on (hot). There is very little data on die-level effects, though my suspicion is that field failures haven’t been studied enough. People just can’t wait to get rid of their older model consumer device. At the system level, however, temperature or rather failure to manage temperature of SoCs has caused enough embarrassing failures – devices exploding, devices locking up thermal runaway, and laptops hot enough to boil water.”