Fail-safe operations and boot load memory are essential for ensuring pricey resources aren’t lost in space.
Aerospace safety requirements and standards vary depending on whether a spacecraft is manned or unmanned, and how crucial the mission is. The defense contractors designing these spacecraft take various approaches to functional safety based on how critical a component is for the mission to succeed.
While losing a few images during an Earth-bound observation may not matter, losing a satellite can be a major setback. And absolutely nobody wants to lose a jet or a rocket.
“I always tell my customers, ‘There are only two people who can afford to lose satellites — Elon Musk and Jeff Bezos,’” said Helmut Puchner, vice president and aerospace and defense fellow at Infineon Technologies. “Why? Because they have the launch capability right at their disposal. If they lose a satellite, three days later, the next ones go up. If you’re a smaller company and you work on this satellite for a year or two, it throws you back tremendously. So you want to make sure your spacecraft is reliable and delivers the outcome you want. If you look back in history, there have always been examples where companies got burned.”
For example, Astranis lost a geosynchronous bird (a.k.a. satellite) because the antenna steering array did not point to the sun 100%, Puchner said. “As a result, the solar panel arrays couldn’t power up. If you don’t have 100% power, your communication cannot perform at the level that you assigned it, so they had to give up on that satellite.”
Fig. 1: Space station in Earth orbit. Source: Infineon
Because spacecraft cannot be easily accessed for a repair, components must be fail-operational.
“This means, in spite of failures, the system should produce the correct values and also inform about the error,” said Varadan Veeravalli, principal functional safety engineer at Imagination Technologies. “So you go for N-modular redundancy — not just rely on one or two redundancies, but go for three or four — then trust that if one of them fails, the rest of them will still deliver the data. That helps us understand there has been a failure, but the system is still providing us the data. Based on that, we should send it to maintenance at some point in time. These are the kinds of parameters that should be measured, and this goes for security, as well.”
Imagination uses a distributed safety mechanism in its approach, with a separate safety mechanism for different modules and different subsystems. “There is not a single mechanism that does everything,” said Veeravalli. “Rather, we identify the failures as parts of functions, and we protect those functions and everything surrounding it.”
The company duplicates critical hardware, instead of duplicating the entire system. “We try to isolate the failures as modules within the modules, and then we protect each and every module via a safety mechanism,” he said. “This has been more efficient than doing full-fledged redundancy or using an approach in which we try to calculate the data path.”
Aerospace/automotive overlap
While classical system mitigation techniques, such as triple module redundancy, are occasionally used in the automotive sector, they are more common in the aerospace sector. “You can already imagine it — three times the power, three times the resources, three times the cost,” said Puchner. “So you do this selectively on super-critical circuit functions.”
For instance, some space systems are activated via a preset counter once they reach a certain orbit. Until then, they are in sleep mode. “This particular chip should not fail at any point, because if it doesn’t say, ‘Wake up,’ the rest of the components will always be sleeping,” said Veeravalli. “Most satellites are sent to orbit based on such protocols. This is where vacuum comes into the discussion, which is something we don’t have to worry about in automotive because we don’t have that problem. But in space they do. Then comes the temperature dissipation. How are they going to handle this? They have to have some conductive material.”
To allow for the effects of temperature dissipation, radiation, and other challenges, a designer can perform fault injections with a digital twin or virtual prototype.
“You can simulate the effects on the SoC,” said James Chew, senior global group director, aerospace and defense at Cadence. “What are some defects and effects that you expect from aging, from being in space, both low dosage and long duration? Or, in the case of nuclear, what are the defects and effects from a high impulse? This followed from the automotive world for autonomous systems. Can you simulate a blockage, or can you simulate what happens at this point? When you have a hardware-accurate digital twin, you’re able to do testing that would probably be dangerous to do. It would either do harm to the device or the system or may be dangerous for humans.”
Memory protection
In aerospace design, designers assign criticality, norms, and measures to each component, determining which ones must always work to achieve the mission. Memory is one area that needs multiple fail-safe measures.
“You need two things in a system to work correctly,” said Infineon’s Puchner. “You need a power switch that always works, and you need a boot load memory, because when you lose the power — which can happen anytime the battery runs out — you’re charging up again, and coming back with your payload. At that point, you want to make sure you’re coming back correctly so your configuration memory is not corrupted. If that happens, all bets are off. You might be able to recover modern systems nowadays, because they have an RF uplink in order to recover, but it’s not easy to design in. It’s possible, but if the boot code is corrupted, you have to do a lot of work to bring back the satellite and save it.”
The most common memory in aerospace and defense applications is DDR DRAM, mainly because of cost and uniformity, said Scott Best, senior director, silicon security products at Rambus. “If you buy a DDR DRAM from any of the top three DRAM makers, the specification is controlled by JEDEC, and they are all 100% necessarily conformant to that specification. It eliminates variation by vendor. It eliminates that risk, and it anonymizes whose DRAMs those are. They do perform slightly differently in terms of size, weight, area and power — but only slightly, and that is what the market wants.”
Another option is NAND flash. “There’s no high-density, non-volatile memory available that’s intrinsically radiation-hardened, so we’re using NAND flash,” said Infineon’s Puchner. “You get a little thumb drive, and it’s a terabyte. People fly these things and they deal with the radiation effects, then just throw the memory away when it’s consumed. It will keep cycling and searching for sectors and keep backing up sectors, but that’s still what we have to do. There’s no other choice if you want the density. This is where aging and redundancy are important considerations — also called wear leveling. When you wear out the sector, you go to the next one. All those techniques are used to fulfill the mission profile and requirements.”
Protecting memory
Memories for aerospace applications can be protected with error correction codes (ECCs) and parities, but that’s not enough. Within the memories there can be address-level faults and data faults, said Veeravalli. “Even if I use an ECC, I would make it redundant, as well, because if there are three failures at the same time, then the ECC might not detect it. If I have a redundancy, then it’s highly improbable for me that both the memories are corrupted at the same exact location, which would help me identify that there has been a fault. And by having TMR [targeted memory reactivation], I can remove this error, and soft reset the part of the memory that is affected. Then I can keep the performance up and running, so I haven’t compromised the data.”
Compared to applications for space, in automotive design it can be enough to detect a fault and know that a particular row bits can’t be used because they have been corrupted. “In that case, I would stop using it or overwrite that, and check if it is a permanent fault or a transient fault,” Veeravalli explained. “If it’s an SEU [single event upset] that is manifested by overwriting it, I can also recover the data part. However, if it’s a permanent fault, then it won’t be recoverable, so we have a soft reset, where we can reset that memory part which is corrupted, and then work with it. I can go for a degraded performance if I don’t use that memory row. Then I can use the other rows and still keep on working on it. All I have to do is inform the system that it cannot use this particular address. They can make that enablement, and then it will work.”
Also, many hazards in the memory can be avoided by creating separate protocols. “Redundant data parts can be created. Then the system can detect anything that is in the entry from the end-to-end protocol,” he added. “There are many variations that we can do, and all of the mechanisms we develop for automotive could also be used for space-based hardware. When we are making a function safe, it can be used for space applications, as well. There is a broader deviation once we go to the next level.”
Standards and norms
When it comes to the standards in play for the aerospace industry, there is a range used, including:
“Everyone has their own standards, and the reason is basically that everyone has their own interpretation of the system, and what they think would be the most sufficient,” Veeravalli noted.
NASA uses the DLA-certified QML components, but the paradigm of space companies using QML components was broken when SpaceX said it was flying automotive components with humans, Puchner said. “I can hardly predict the future. It could be a mix of both auto and space components, but it became symbolic to break with that rule because of cost pressure, because of architecture limitations that you have nowadays with ceramic components, and which components are available for some projects.”
Automotive solutions are also a candidate for lower orbit mesh networks, such as Starlink, where the quality needed is not at the same level as the geosynchronous satellites that have to live 25-plus years, said Puchner.
When a space chip goes into an SoC, designers need to think about power. “How do they take care of thermal variations, performance variations, and then in a vacuum, how do they have the thermal associations? All of these factors come in once the IP goes into a space or aerospace fabrication plant,” said Veeravalli. “But when we are developing an IP, we need to ensure, by FMEA alone, that each and every functional failure could be mitigated by a safety mechanism or an architectural level change, as well. It’s not necessary that we have to build a safety mechanism every time. If in the architecture we can isolate some of the safety failures, then this could also help, for example, ensure that the same instructions are not processed at the same time, but there is sufficient delay between them.”
Once a product enters prototyping or fabrication, there is a lot of testing — temperature cycling, corrosion testing, and shielding — that is dependent on where it is going. Stress testing and verification simulation happen on both auto and space chips, but at different levels. At the IP level, there is not much difference between auto and space, but in terms of the SoC, the rigor is different. Space will need testing at lower temperatures, for example.
Another difference between the sectors is that some automotive standards allow you to do some of the testing in different levels, but in space or aerospace, even the low-level sub-modules and components need full-fledged testing, said Veeravalli. “Whatever they are going to put in cannot fail. In order for them to understand that the functions are properly implemented, they need to verify everything.”
This often includes many suppliers and distributors. “For example, if Boeing is building an airplane, it’s not going to do everything from scratch,” he said. “It is going to give someone else the contracts, so it’s subsidized, and there are layers of protocols that need to be followed for each and every one of them.”
Conclusion
Aerospace and automotive verification both rely on the V model, which shows the relationships between each phase of the development life cycle and its corresponding phase of testing. Whether they are safety-critical or mission-critical applications, a certain amount of rigor must be adhered to.
“You start from the requirements, then you go all the way to the verification,” Veeravalli said. “What most of the standards are trying to cite is, if you don’t have the requirements properly identified, then the end-product will not be the same. You need to verify those requirements, and you need to provide a bi-directional tracing. This is one common thing you can see in all of those standards. You have the requirements, you have the specifications, and then you have the verification test cases.
Related Reading
Mission-Critical Devices Drive System-Level Test Expansion
SLT walks a fine line between preventing more failures and rising test costs.
ISO 26262’s Importance Widens Beyond Automotive
The international standard has been proven effective in automotive functional safety and has begun to spread to other markets.
Why Chips Fail, And What To Do About It
Improving reliability in semiconductors is critical for automotive, data centers, and AI systems.
Leave a Reply