Safety mechanisms designed to handle rare events can become unreliable under sustained or intense fault conditions.
Fault injection is usually discussed in the context of security, where adversaries deliberately induce faults to bypass protections or extract sensitive information. In safety engineering, by contrast, faults are often treated as rare, random events driven by natural or environmental factors. The recent Airbus A320 recall is a good example of how a primarily safety incident can still benefit from security-grade fault testing techniques.
Airbus recently implemented an urgent software update on approximately 6,000 A320 aircraft after an earlier incident in which flight control data was corrupted by solar radiation. Because the update was unplanned, it caused significant operational disruption, but it was deemed necessary to maintain safety.
While serious accidents were prevented, the resulting air-traffic disruption and logistical overhead were almost certainly very expensive.
The underlying issue was linked to a software version that appeared insufficiently hardened against radiation-induced faults.
Flight-control systems typically rely on triple modular redundancy (TMR). In such a scheme, every computation is performed three times, and a voting mechanism compares the results. If a single fault occurs, the two correct results outvote the faulty one, and the system continues to operate safely.
This approach is very effective for handling isolated, incidental faults. However, when faults become frequent, triple redundancy can begin to fail:
In other words, a mechanism designed to handle rare events can become unreliable under sustained or intense fault conditions.
Both hardware and software in aerospace systems are typically tested for error situations. Tests can be performed by simulating solar radiation using a radiation source or by other methods to inject faults. This is important to prevent in-flight malfunctions that could lead to serious consequences. So why might this software still fail?
Investigations concluded that a specific combination of inputs escaped the expected handling in the latest software release. This suggests that the test strategy, while likely extensive, was not watertight. If testing is conducted randomly, it is possible that failure scenarios are missed. On the other hand, if testing is conducted at low fault intensity, likely, more complex multi-fault scenarios are not identified.
Traditionally, safety and security risks are distinct:
Security evaluations push systems with the intent to cause faults. They use high-intensity, precisely targeted fault injection campaigns and aim for high test coverage, because attackers are assumed to be persistent and adaptive.
However, when accidental faults occur at a high rate, such as during solar storms, the safety problem begins to resemble a security problem:
Therefore, it would make sense to treat high-frequency, high-impact safety problems in the same way as security problems and conduct rigorous testing. In our 2017 FDTC paper [https://ieeexplore.ieee.org/document/8167705], we also show that functional safety mechanisms are not sufficient when under attack.
It’s important to recognize the fundamental difference between faults caused by nature and those introduced by adversaries: intent and adaptability. Natural events can often be modeled statistically, where the chance of a fault is represented by a probability p. In contrast, a knowledgeable attacker can target vulnerabilities, potentially raising the likelihood of a fault and, by extension, the risk of compounded failures.
This distinction should have direct implications for testing strategies. Random fault injection helps estimate the average risk, but intentional, targeted testing is needed to determine the worst-case scenario — the upper limit of vulnerability for critical systems. Modern fault injection platforms allow for precise targeting and timing of fault injections, covering both simple and complex scenarios efficiently and reproducibly.
For operators of aerospace systems and other critical infrastructure, embedding this level of systematic, security-grade fault testing strengthens product resilience. We see this approach as a promising research and development direction, and we look forward to engaging with practitioners who want to explore how this can push the state of the art in safety assurance.
Leave a Reply