Designing Resilient Electronics

Eliminating downtime in safety- and mission-critical applications.


Electronic systems in automobiles, airplanes and other industrial applications are becoming increasingly sophisticated and complex, required to perform an expanding list of functions while also becoming smaller and lighter. As a result, pressure is growing to design extremely high-performance chips with lower energy consumption and less sensitivity to harsh environmental conditions.

If this sounds difficult, it gets even harder from here. In the past, many of these systems relied on chips developed at older process nodes or, in the case of cars and airplanes, mechanical systems. But as more data is generated and processed under a wider range of operating conditions, particularly for mission-critical and safety-critical applications, the entire semiconductor ecosystem is being driven to develop IC designs that are more resilient to everything from extreme heat and cold to longer lifecycles and higher utilization rates within those lifetimes.

“Semiconductors for automotive electronics must meet even stricter requirements including extended temperature range, device robustness and functional safety certifications (ISO 26262),” said Frank Ferro, senior director of product marketing for IP cores at Rambus. “From an IP design perspective, reliability is built in from the start using device models for extended temperature and aging. If needed, in applications like automotive, potential points of failure are identified and remediated with circuits that are fault tolerant, and in some cases redundant functions,” he said.

The idea of resiliency has been around for some time, particularly in error-correcting memory. But developing chips that can gracefully fail over to other chips requires an entire ecosystem, from the semiconductor foundry (qualified process nodes), device packaging, underlying semiconductor IP to the certification bodies that ensure compliance.

System complexities are increasing across all verticals including industrial automation, automotive and aviation, said Neil Stroud, senior director of technology strategy for Arm’s Automotive and IoT Line of Business. “Historically, many of the required elements have been ‘single function’ which means they consume more physical space and weight, as well as consume more power. To help manage physical space, weight, and power consumption, we are beginning to see a trend where these functions are consolidated.”

For example, in the automotive segment ECU’s are being consolidated into domain controllers. Avionics is making the transition from single core to multicore SoCs. And manufacturing is combining multiple automation functions such as programmable logic controllers, human-machine interfaces and safety functions into a single box.

“This naturally drives a need for increased compute coupled with high safety integrity levels, while resulting in smaller footprint developments and requiring lower comparative power consumption and thermal designs,” said Stroud. “Designing a lock-step feature in application CPUs is a great example of a solution to this challenge, and Arm is continuing to work closely with the ecosystem to solve these challenges. Harsh environmental conditions add an extra vector to the design, and these can be augmented by additional measures added in at the silicon development stage.”

Traditionally, reliability and risk mitigation meant adding guardbanding to assure requirement were maintained. But as an increasing number of these chips move to advanced geometries, this is no longer a viable solution.

“You don’t want to pay out tens of millions of dollars to develop a chip on one of the latest semiconductor nodes and then lose a significant chunk of the performance and power advantages through traditional guardbands,” said Richard McPartland, technical marketing manager at Moortec. “Operating in harsh environments with long lifetimes adds further dimensions to this problem,”

McPartland noted that one of the key design techniques used to address these issues is to embed a fabric of in-chip monitors to give visibility into on-chip conditions. “This is an essential step, and it enables optimization of power, performance and/or reliability at bring up of new silicon and later in mission mode. Gone are the days of including a single temperature sensor and assuming everything will be fine. The latest finFET designs typically include tens of temperature and voltage sensors plus process speed detectors, monitoring conditions at critical circuits across the die. Of course, designers simulate and sign off performance, but with so much circuitry being software-driven, worst-case conditions can be difficult to predict and significantly different in reality to those simulated. Embedding a fabric of in-chip monitors is increasingly seen as standard design practice on advanced semiconductor, especially finFET nodes, and should be considered early in the design flow. It’s an essential step for minimizing guard bands and optimization.”

Industrial concerns
Resiliency includes far more than the functioning of a particular gate or IP block, however. A functioning chip is of little value if the data flow is interrupted anywhere in a system.

“While communication techniques have greatly evolved and improved in the last few decades, the focus has generally been on transferring increasing amounts of data over either designated wires, such as Ethernet or Fiber Optics, or wirelessly such as WiFi or LTE,” explained Zeev Collin, vice president communications products at Adesto Technologies. “In many cases, industrial communication must be carried out in ‘unfriendly’ environments such as underground, or in shielded structures. There may not be any control over what other equipment is connected to the same media, as in the case of power lines, a popular media for industrial connectivity.”

While industrial communication speed requirements are in general less demanding than in the consumer world, the typical environment is less than ideal for reliable communications. Moreover, existing systems were not designed with communications needs in mind. There is often no dedicated high-performance wiring available.

Further, there are different considerations than consumer devices due to harsh environments and nature of the communications patterns, such as a steady flow of traffic versus bursts of information, a variety of protocols being used, and more diverse sources of interference. And there is a need to manage tonal interference and impulse noise by adding notching and other filtering and redundancy techniques.

“Improvements in performance can be realized by adjusting the modem techniques to handle the noise typical to the medium,” Collin said. “Unlike many wireless communications implementations that are defined by their sensitivity, powerline is more prone to sources of tonal noise, such as switching power supplies connected to the medium, as well as impulse noise introduced by turning on and off industrial equipment. XXR (Adesto’s solution for noisy environments) effectively deals with such noise profiles by placing up to 4 independent channels (carriers) anywhere on the operating spectrum. Placement of the channels can be individually selected based on noise and impedance of the environment, and the channels can be used for redundancy by sending the same data and combining it on the receiving end. Correct data reception only requires one good channel at a time to maintain communications. By combining redundancy, frequency agility and error correction coding, we can avoid both known and variable frequency-dependent disturbers in the powerline to offer one of the most robust communication protocols available, achieving extended ranges up to several kilometers.”

Other areas such as industrial control also have very strict resiliency requirements. “Industrial control systems for nuclear power plants, for example, have to continue operating even if there is a failure in the cooling system,” said Ron DiGiuseppe, senior strategic marketing manager at Synopsys. “So they need to be fully resilient, such that if one part of the cooling system fails, there’s a backup system.”

Resiliency in automotive
With so much focus on the electrification within the automotive ecosystem, there is a tremendous amount of attention being paid to resilience with vehicles, and the design infrastructure of automotive systems. Safety, reliability and quality are the primary goals of resilience here.

“Resilience applies to the capability of the system to continue operations in the space of some sort of disruption, and that’s a little different from the automotive goal of safe operation. That distinction must be made,” DiGiuseppe said. “This means full system resilience is full operational maintenance after some sort of a system disruption. That’s a separate goal from continued safe operation. Within the context of the automotive IP segment, there is a difference between that full system resilience and safe operation.”

For example, in automotive, it’s known that if there is a failure, there are multiple responses to that failure, which could be full continued operation, partial operation, or some sort of a safe state. Full operation in terms of the failure is not the 100% goal of safety for automotive, he pointed out.

“Safety for automotive is more like minimizing risks due to hazards if there is some sort of malfunction. The official definition is the absence of unreasonable risk due to hazards, so that’s what we want to manage in safety — how the car or the system responds to hazards, and whether or not they cause unreasonable risk,” DiGiuseppe said. “And then the response to that could be continued operation, like a fully resilient system. It could be controlling that system to go into a safe state, and that’s a little different from full system resilience. Going into a safe state could be like an autonomous vehicle pulling over to the side of the road and turning itself off. That’s responding to the hazard by going into a safe state, and that’s a little different from system resilience, and full continued operation in case of that. So there’s a difference between resilience and managing safety. Our customers are 100% focused on the products that are developed for automotive, especially in the safety-critical systems. Some systems in the car are not safety-critical, but for those that are compliant to the functional safety standard ISO 26262, that’s a big focus, and it’s a requirement from our customers that they want products that we develop that go into the automotive supply chain to meet ISO 26262 standards.”

The level of resiliency needed is another consideration.

“For automotive, it’s a two-step process,” he said. “One is the early step to set the safety goal, identify if the system is safety-critical and the amount of criticality. There’s an early safety review. That’s the required ASIL level. ASIL has two functions — to set the goal of the safety, and to measure how well you have accomplished that goal. In the early stage, it’s defining the goals, and then defining the safety requirements of that system, and then executing to the safety requirements,such as designing safety mechanisms. So the first step is not so much execution of the plan, but setting a safety plan, setting the safety goals, setting those safety requirements. That series of steps are more a systematic approach.”

Additionally, safety and resiliency have two aspects. One is the systematic aspect, the development flow that is followed, DiGiuseppe continued. “First you have to define a safety development flow. Then, you have to have a system in place to ensure that the design teams follow those safety flows. Those are typically a quality management system. It is a requirement to have these flows, and it’s a systematic requirement. Connected to this, there should be some sort of monitoring to make sure the flows are followed, typically covered by a quality management system.”

After that comes the execution of the safety plan, which is the actual design of those safety mechanisms that were set in the goals and requirements earlier on in those planning stages. Part of the execution should examine the safety level. If it’s a mid-level ASIL, it’s a matter of how much risk is in the system that could cause harm. This is the other aspect of safety, namely whether there is a risk that could cause harm to the public, the driver, or anyone associated with that system.

Multi-dimensional, multi-physics problems
Ensuring all of this works as planned is a multi-dimensional challenge. Today, with Level 3 automation, electronics account for 30% of the total cost of a car. That number could rise to 50% of the total cost with Level 5 automation, according to ANSYS. Self-driving and semi-autonomous cars will rely on an increasing number of electronic sensors such as radar, LiDAR, ultrasound cameras and fusion sensors, which are expected to provide 360° surveillance and object identification and classification to prevent crashes and ensure operational reliability.

The vast amount of data gathered by the sensors must be processed in real-time, and decisions must be made dynamically. For example, the automotive safety system needs to distinguish if there is a raccoon dashing in front of the car on a rainy night or a bit of blowing tumbleweed. And it must make this determination in milliseconds.

Fig. 1: Autonomous car concept. Source: ANSYS

These complex systems already are installed in semi-autonomous vehicles, as well as early-stage fully autonomous vehicles. But designing and verifying these immensely complex systems is possibly even more complex than the systems themselves because they need to include what the systems are supposed to do, while accounting for electromigration, electrostatic discharge, thermal reliability, statistical EM budgeting, electrical overstress, aging — and functional safety.

With the transformation of automotive electronics systems from a chip, package and circuit board perspective, along with the increasing sophistication of avionics, industrial automation, networking applications — resilient design is only becoming more challenging. The path forward must include an understanding of the design challenges, plans and systems for implementing resiliency, and novel chip-level approaches. Add to that list new ways to leverage IP for safety and security and implementing the right tools to cover all scenarios.

How this looks even five years in the future remains to be seen. As systems evolve, so do the requirements for how to keep a system from breaking down and causing other problems. Even the concept of resiliency may change across various applications as those systems evolve and become increasingly autonomous. But it’s clear that all of this will become much more challenging for the entire supply chain, and the tasks that need to be solved will become significantly harder.

Leave a Reply

(Note: This name will be displayed publicly)