Two methods for preventing failures in critical systems.
Functional safety first became a major issue for the semiconductor industry in 2011 with the introduction of the ISO 26262 standard for implementing functional safety in the automotive industry. Before that, functional safety had already been standardized in a general manner for all industries since the end of the 1990s in IEC 61508. However, in the field of industrial automation, where the IEC 61508 is manly applied, safety systems have tended to be — and still are — built discretely.
It is only in recent years that integrated circuits have started to emerge here — one that is clearly being driven by experience in the automotive sector. This is evidenced not least by how many developers in other industries, including those whose development work is not carried out in accordance with ISO 26262, reference Part 11 of the standard. Part 11 was published in 2018 and explicitly addresses its application to semiconductors.
Another way in which the automotive industry has had an outsize influence on developments in the field of functional safety is strong cost pressure as well as weight and size limitations, combined with high unit volumes. These factors are not present to the same extent for other safety systems, and they push the traditional approach to system safety, which uses redundancy, to its limits. This is particularly critical for systems that cannot assume the switched-off mode as a safe state, such as a fast-moving car in heavy traffic.
To reconcile the competing factors of safety on the one hand, and size, weight and price requirements on the other, requires new approaches to the safety mechanisms used. As things stand, a safety mechanism usually works in such a way that it detects a fault in an element of the safety-critical system as soon as it occurs. This is sometimes relatively simple, e.g., a voltage measurement for the fault “open connection.” Or it could involve a fault in a more complex element with redundant execution and comparison, such as two processor cores in lockstep, so that when their outputs are compared and there is a deviation, it shows a fault is present. In the first example, there is no need to include redundancy to operate a fail-safe system that shuts down in the event of a fault. In the second example, the redundancy is inherent. In both cases, if the system is to continue to function in the event of a fault (fail-operational), further redundancy must be provided — in the first case, to switch to the redundant system, and in the second, to determine by majority vote which result is erroneous and to shut down the corresponding core.
One option for novel safety mechanisms to avoid this problem is to make use of predictive health monitoring. This is the collective term for various methods that have been established in the field of reliability for a long time. Two of these methods will be considered below and evaluated for their suitability for use as a safety mechanism.
In the first method, an electronic component’s remaining lifetime is estimated from its load history. This involves recording various stress variables such as temperature or current. These are either compared with available statistics on field failures that have already occurred, or the loads are used to simulate failure mechanisms (physics of failure). This allows a prediction to be made a very long time before failure. However, the prediction is relatively imprecise, and its confidence is also relatively low. In any case, it is based on models and statistical assumptions and does not consider the component’s individual properties. This makes it less suitable as a safety mechanism. However, it could still find application in the field of functional safety — namely for better estimates of failure rates.
The second method is damage detection. This technique examines components to detect existing damage that is getting progressively worse before it can lead to a functional failure. There are various ways to implement such damage detection — mechanical damage in IC packaging can be identified using thermal impedance measurements; breaks in conductive paths can be found using time domain reflectometry; and growing damage in transistors can identified via changes in their threshold voltage. This method is much more suitable for use as a safety mechanism. Although it predicts failures only in the relatively short term, its confidence is definitely on a par with that of conventional safety mechanisms, depending on the exact implementation and use case. It can predict more than 90% of failures — with a negligible false-positive rate.
So from a technical perspective, there is little standing in the way of implementing predictive health monitoring as a safety mechanism. As is usual in functional safety, however, standards will need to be set before this approach becomes accepted in industry. Working groups at various standardization organizations have already started work on this. Consequently, the next versions of ISO 26262 or IEC 61508 are expected to contain corresponding approaches.
Leave a Reply