Planning For Failures In Automotive

Is it better to build expensive parts that are highly reliable, or redundant cheaper parts?


The automotive industry is undergoing some fundamental shifts as it backs away from the traditional siloed approach to one of graceful failure, slowing the evolution to fully autonomy and rethinking how to achieve its goals for a reasonable cost.

For traditional automakers, this means borrowing some proven strategies from the electronics world rather than trying to evolve traditional automotive strategies. So while carmakers have pushed for zero failures over 18 years, companies like Waymo—which started out as Google’s self-driving car project—opted for inexpensive off-the-shelf parts with understanding that all parts will fail, and the best way around that problem is to add redundancy.

“It sounds a bit misleading that everybody should now take the cheapest parts and bring them into the design of their systems, but what it means is that we have to design with fail-safe in mind,” said Burkhard Huhnke, vice president for automotive at Synopsys. “There has to be a system continuously observing what’s happening in the function of the running system, like self-testing, monitoring, correction, and a fail-safe. In case of a failure, let’s say there is a computer calculating the path of the car, and with its double redundancy it comes to a decision that Computer One says, ‘yes,’ and Computer Two says, ‘no.’ The system then needs to be able to make a fail-safe decision, and that means stop the car at the shoulder of the road. You have maybe a couple of seconds — maybe 10 or 15 seconds, depending on your speed — to make that decision.”

The auto industry has been building warning systems into their designs for decades. “In case of a failure, in case of an error light, you need to be able to stop your car safely, and for many years that has been the design rule within automotive,” Huhnke said, noting this has hindered the pace of innovation. “You usually think you should not take too much risk by designing a fail-safe system that might cause a delay in the innovation process. The current industry is very aware of that, which is why they have created regulations and standards around these, namely, ASIL and ISO 26262, which have to be adjusted to new requirements. But that’s the basis of the regulation and the standardization, which is very important in this context.”

The challenge is in blending these different approaches for a reasonable cost, and this is where the computer industry has been very effective. The basic concept behind RAID (redundant arrays of inexpensive or independent disks) drives was to use multiple inexpensive disk drives to achieve the same kind of reliability as more expensive disk drives, particularly those sold by IBM for its mainframe computers.

“The whole idea was that you can use cheap stuff, and just by having enough of it, basically achieve high reliability,” said Dean Drako, CEO of Drako Motors and Eagle Eye Networks. “You had to have some complicated software to do it. That was one of the first places where the use of redundancy and cheap stuff to achieve reliability was done in the computer industry.”

Google was famous for doing the same thing for computers, he said. “Basically, they designed things such that they didn’t really care if one of the computers failed, and that approach was very different in many ways. They weren’t the only ones doing it at this time, but there was this change going on of, ‘Let’s not go and buy the $50,000 server and make sure it’s super reliable. Let’s go buy 50 $1,000 servers, and if 5 of them fail, so what? The other 45 are still going and everything’s fine, because the software has reconfigured everything, so it’s all good because we have redundancy built into the system.”

Redundancy is still used today by most companies offering cloud services, such as Amazon S3, Amazon EC2, and Eagle Eye Networks. The idea is that computers could fail at any moment.

“We’ve got to have redundancy built into the system so that everything keeps running, even if one of these cheap computers dies. The database is redundant, the servers are redundant, the file stores are redundant,” Drako said. “That’s gone a step further in the design of things such that if you take a system like a, or an or an Eagle Eye Networks system. There are likely 300 to 3,000 functions that are implemented on different computers, and they’re all talking to each other. Because the network, computers and storage are not reliable, the software has started to be developed basically through trial and error. And because you have no choice but to assume that it’s going to get bad stuff from other players, it’s not going to work properly, the network’s going to go down, everything is going to break, but I’ve got to keep doing my job, companies like Waymo are translating this way of thinking into the automobile self-driving world. They are trying to take that into the automotive industry and do everything redundantly, so if this camera fails, we’re still okay because we’ve got enough other cameras.”

For example, in a self-driving car, there are four cameras mounted on the front. The car can still operate with only two of those. If one of them fails, the system isn’t as good, but it’s still safe. But if two more fail, then the car has to shut things off and fail in a graceful manner.

The concept of graceful failure assumes everything is going to break at some point, but the vehicle still has to function safely. The aerospace industry adopted this concept a couple decades ago when it shifted from a 0.000001% failure rate to 0.01%, with the understanding that it could design 10 rockets or rovers for the same amount of money. And while one or two would fail, they’d get eight successes instead of one.

“If you want to design something that will never fail, it gets very expensive,” said Drako. “If we get enough redundancy in the system, we can use cheap components. And if they fail, it doesn’t matter.”

Serial development challenges
This is only part of the mindshift required for autonomous and assisted driving. The entire development cycle for carmakers needs to change.

“At a high level, the productivity of the traditional serial development of just going via a written type of requirement is on the decline because you cannot simply capture all requirements in a document,” said Marc Serughetti, senior director of business development and product marketing for automotive and virtual solutions at Synopsys. “Automotive OEMs have to think more about the expertise they need to have in terms of electronics itself, and the complexity of those systems. The semiconductor company on the other side, especially in the context of functional safety, cannot be thinking in isolation. You have to think about how something is applied. There’s a very strong evolution right now to rethink how those collaborations are being done between semiconductor companies and the OEMs, and obviously, the role of the Tier 1 in all this. There are a lot of changes happening here that are impacting how engineers work, but more importantly, how they have to collaborate across companies, as well.”

Some OEMs have caught on, Huhnke noted. “They understand that the Shift Left into the competence of hardware and software is required for an automotive OEM and the high complexity of those systems, which needs to be designed, starting in a comprehensive design way from the SoC up to the system up to the vehicle. This requires a direct conversation with the EDA industry. Due to the fact that cost pressure leads to this extreme need for the highest level of integration, you have to talk as early as possible with EDA companies, providing automotive building blocks that are functional safety-ready and also ready for the fail-safe mode.”

The need to partner closely in the automotive ecosystem includes every part of a vehicle, even battery and charging options, noted Puneet Sinha, director of new mobility solutions for the Mechanical Analysis Division of Mentor, a Siemens Business. “In the past, the OEMs didn’t have to worry about what kind of gas station was available, but now it matters when it comes to the charging infrastructure, because at the end of the day, customer experience will come into play. For instance, one could argue that Tesla’s advantage is not just the vehicle. Many companies are making or can make EVs. But what is the charging infrastructure? That is Tesla’s biggest advantage over a lot of other OEMs, and this is what the new entrants on the startup side of the players — or even the long established OEMs who are entering when trying to establish themselves in the EV industry — have to grapple with.”

Stuff happens
Still, unexpected corner cases always crop up.

“No matter how much testing, no matter how much validation you do, there’s still likely to be a scenario you haven’t uncovered yet,” said Bob Siller, director, product marketing at Achronix. “Waymo always talked about how many millions of miles they had of driving experience. At a certain point it becomes infeasible to get every single scenario. A Tesla crashed into a semi truck that was white, and against the background the car couldn’t discern it. How are you going to know every single color of every single vehicle out there and the specific optics where the sun will be setting?”

This is why hardware programmability is gaining traction in the automotive world. “Redundancy can be designed to know when a failure is a true failure versus a one-time thing,” Siller said. “With embedded FPGA, you could design dual paths of logic that would be validated against one another, so you could basically have a custom algorithm that could validate a data stream and then have a parallel path that takes that same data and validates it, as well. You could either do this in the same device, or physically have two separate devices with the same algorithm running across it, and that would just be running in the embedded FPGA. You could customize it according to your needs.”

Another approach to dealing with scenarios that aren’t expected is with over-the-air updates, which Tesla employs. “But if you recognize a failure of one vehicle, how do you deploy that across your entire fleet?” Siller asks. “And for the most part, it’s been done in a software sense, right? Tesla will introduce a new CPU algorithm or firmware update. But with embedded FPGA, you can change the hardware configuration, and that is a different paradigm. You could change the data path, how data is being processed based upon algorithmic learnings, and reconfigure the FPGA and do that without having to recall the device. Obviously, that’s a huge cost savings for automotive manufacturers when they don’t have to take a recall, and they could introduce an over-the-air update,” Siller said.

Other new approaches to monitor the system for functional safety, or redundancy, could include on-chip sensors or monitors to provide yet another level of observation of the system. The goal is to be able to detect changes in a system if they are moving towards failure, and to ensure the sensors are working properly to provide that kind of data.

“The sensors themselves do some self-checking to make sure that they’re giving robust readings, so you don’t miss a failure of a system as it approaches,” said Stephen Crosher, CEO of Moortec. “One of the interesting things is that there is so much consumable technology that’s now being brought into the vehicle. It’s been happening for years with entertainment, and now we’re seeing it with AI being pulled in. That’s upping the standard and the levels that we are needing to design for, not only at the system level, but also at the IP block level. Essentially, we have to design for both data center and automotive, and maybe mobile phone and automotive, because the lines are being blurred a lot more.”

Bigger chips
With more consolidation of functions within the ECUs in vehicles, the chips are getting bigger. In fact, they’re much larger and more sophisticated than any chip in a cell phone, and have many more brains on it, noted Kurt Shuler, vice president of marketing at Arteris IP. “They’re more like something you would find in a data center, but it’s in your car. It’s got to sip power from a battery and it can’t have too much heat, so they’ve got all these different challenges. Then, if you look at the design teams that do this stuff, as design approaches change to anticipate failures, this is the reason why the traditional semiconductor companies are having trouble adapting — companies that have been incumbents and have done automotive chips for years.”

Most of these companies have done small, 100,000-gate designs in the past.

“It’s small enough where you can do all the analysis, all the functional safety documentation down to the bit level,” said Shuler. “You really can make sure that this thing is safe. The problem is that approach is useless for dealing with the 2 billion-gate chip that is looking at everything coming in from these cheaper cameras, and cheaper LiDARs, and cheaper radars that have to take this imperfect information and turn it into some vision of reality that actually makes sense.”

This is one of the reasons companies are looking at building platforms. There could be one chassis and multiple chips, each with a specialized function.

“You want to try to re-use as much of the architecture, because from a functional safety standpoint there’s the analysis of all of that and you get to re-use that,” Shuler said. “When you start looking at SOTIF (Safety of the Intended Functionality) and all the system-level AI things, that makes it easier to characterize. You’re removing some of the variability, so you can focus in on a known good chassis, a known good architecture for these things. Communications, memory, some of the CPU clusters and generic processing — all of that becomes one common chassis, or one common architecture that the other design-specific stuff plugs into. That’s one way of being able to do this better, because you’re re-using that. You can’t look at this whole 2 billion gate chip, and look at every bit in it every transaction like you could with the MCUs of old.”

Put simply, a bottom-up approach doesn’t work well by itself, and neither does a top-down approach. “The approach that has come out of this is more of a shared platform, shared chassis type method where you get reuse of not only the hardware capabilities, but you get reuse of the functional safety mechanisms and the analysis of those functional segment mechanisms,” he said. “That’s been a big change for everybody, and it’s very different. The guys who have been in automotive for 30 years, they’re having to change because, for example, within the interconnect we do fault injection of our interconnect. That is an ingredient in the analysis of the overall SoC. Our customers configure our interconnect. We sometimes have no idea what that configuration is, or is going to be, but our analysis is something that they use internally in that analysis. If you were to take one of these big chips and do fault injection at the netlist level, at the gate level, it becomes a process that takes many millions of dollars of fault injection simulator licenses running in parallel, and takes many pots of coffee to the point of doing a fault injection campaign for years.”

The ISO 26262 spec has been adapted to accommodate this in that fault injection can be done at a higher level than post synthesis, and can be run at the RTL functional level. “Still, getting some of the automotive guys to accept that this is acceptable is a challenge, but it’s progressing,” he added.

Looking ahead
The next 10 years is sure to be a thrill ride to see how the automotive industry changes, and which companies will come out as winners, Mentor’s Sinha said. “When we look at the kind of activities happening in the world, it is clear that the companies that can understand, master and give the right importance to the connectivity, among different disciplines — electrical, electronics and mechanical — together from the beginning, not continuing with the siloed way of doing things that the automotive industry has largely done in last hundred years, those are the companies that have better chance to win, whether it’s a human driven electric vehicle, or a Level 4 or Level 5 electric vehicle. Connectivity among these different domains in terms of how we’re designing these vehicles, how you are producing these vehicles and how you are going to consume the experience of these vehicles, all of this has to be part of the discussion in a connected fashion, not in a siloed way.”

Related Stories & Videos
Challenges To Building Level 5 Automotive Chips
The challenges to build a single chip to handle future autonomous functions of a vehicle span many areas across the design process.
Autonomous Vehicles Are Reshaping The Tech World
Even before fully autonomous vehicles blanket the road there is major upheaval at all levels of the industry.
Safety Critical Design In Automotive
Finding faults at the chip and system level.
Automotive System Design
How to build and update chips in cars.
Building An Efficient Inferencing Engine In A Car
How to model a chip quickly, including corner cases.
Automotive Knowledge Center

Leave a Reply

(Note: This name will be displayed publicly)