Safety Plus Security: A New Challenge

First in a series: There is a price to pay for adding safety and security into a product, but how do you assess that and control it? The implications are far reaching, and not all techniques provide the same returns.

popularity

Nobody has ever integrated safety or security features into their design just because they felt like it. Usually, successive high-profile attacks are needed to even get an industry’s attention. And after that, it’s not always clear how to best implement solutions or what the tradeoffs are between cost, performance, and risk versus benefit.

Putting safety and security in the same basket is a new trend outside of mil/aero, and it adds both complexity and confusion into chip design. For one thing, these two areas are at different levels of maturity. Second, many companies believe they only have to deal with one or the other. But as the automotive industry has learned, security impacts safety and the two are tightly bound together. Interestingly, the German language uses the same word for both – sicherheit.

Incorporating safety and security into a product is not about a tool or an added piece of IP. It requires a change in workflows, which makes it a more difficult transition than just a new spec item or workflow step caused by the latest technology node. It also requires a mindset change in how to approach the design in the first place, because safety and security both need to be built into designs from the very outset.

“It is important for people to reorient their priorities and understand the tradeoffs,” states Rob Knoth, product management director for the DSG group at Cadence. “Even design schedule has to be considered. There may be a time-to-market window, and factoring in safety and security on top of an already challenging design schedule is difficult.”

Ignoring safety and security is no longer an issue for an increasing number of products. “As chips become more capable and are being used in more fully autonomous systems, it is increasingly important, and in many cases critical, that they correctly and rapidly analyze and react to the environment,” points out Tim Mace, senior manager business development in the MIPS Business Unit of Imagination Technologies. “Failure to do this can arise from errors in the design, either at design time or during operation, or from deliberate malicious intent. Consequently it is becoming much more important and, increasingly, a requirement that they address safety and security concerns. The requirements, risks and countermeasures for those systems and those chips must be well understood and qualified.”

There are three classes of faults that have to be considered when looking at safety and security—random, systematic and malicious. Each of these requires a different approach, different kinds of analysis and each will result is different impacts on the product schedule and cost.

Random failures
Perhaps the easiest category to analyze is random failures, and there are an array of tools and techniques that can be used to guard against these in hardware. The approach is similar to the techniques used for semiconductor test before Scan Test techniques were adopted. A fault model is created that contains at least each node stuck at zero and stuck at one, and a fault simulation is used to find out if the functionality changes. For a safety-critical device you want every fault to either be detected or to have no impact on operation. “This provides protection from things such as an electromigration (EM) failure or a transient failure throughout the lifetime of the product,” explains Knoth.

Source: ArterisIP

The question is how to accomplish this. “To create an extremely safe system, one could simply duplicate or triplicate all the chips in the system, and even all the IP systems on each chip (we’ve seen this done in practice!),” explains Kurt Shuler, vice president of marketing for Arteris. “Creating redundant systems blindly like this can guarantee ASIL D qualification, but at an extremely high price in performance loss, die area and software complexity.”

Better ways are possible and this is how companies can compete with each other. “Not all added redundancy simply duplicates the design,” points out Mace. “For communication and storage, the redundancy can be of the form of parity or ECC protection while redundancy at the functional level can be supported by using the existing hardware again to check the operation – this adds some power but need not increase area.”

There are places where redundancy is necessary. “Redundancy is done so that you can have one check the other, and you do this when the state complexity to predict behavior is as complicated as the task itself,” explains Drew Wingard, CTO at Sonics. “You can duplicate certain sections of control logic instead of doing the whole block. The techniques used for the network on chip (NoC) are not fundamentally different than for other kinds of logic, except that you have to recognize that complex logical operations in a NoC tend to be around the control and not around the data.”

The challenge is knowing which parts of the hardware to concentrate on. “This could include redundant register files, added ECC protection in memory, redundant CPU core so that you can go lock-step,” says Ashish Darbari, director of product management for OneSpin Solutions. “While you may add a lock-step CPU, you may not need to duplicate all of the register files. There is a lot of design architecture knowledge that is applied. The challenge is measuring if it does the job.”

And there are hidden dangers. “It doesn’t take a genius to figure out that you need to be astute about what and how you use duplication,” adds Knoth. “You can quickly double the silicon area. That does not directly make for a safer product. It makes a more complex, more error-prone product. It is a delicate science.”

Shuler summarizes the techniques most commonly used. “Add redundant hardware blocks only where this has the greatest effect on functional safety diagnostic coverage, and only for IP that does not have other sufficient protection. To define safety goals and where to implement functional safety features requires thorough analysis of potential system failure modes, and then quantitatively proving whether safety mechanisms cover these faults.”

The right balance point also depends on the intended market. “For IoT and small devices, we can’t just build in lots of redundancy,” says OneSpin’s Darbari. “Half of the time you need these devices to be really small and have good power characteristics.”

In avionics, safety is assured by duplication, and it utilizes diverse design and architecture. This adds considerable design and verification expense and cannot be justified for most markets.

“Safety and security protection comes at a cost,” Shuler says. “If a system can’t meet near real-time latency requirements and huge processing throughput requirements, then people could be injured or killed. But if the system is too expensive to be developed and fielded economically, then perhaps the whole industry loses an opportunity to save human lives.”

One implication of adding protection for random faults is that many aspects of the original semiconductor test now have to be incorporated into the product. “You now see test included in the functionality of the product,” says Knoth. “It not just manufacturing test anymore, it is in-field test, it is power-on test. Teams need to do RTL insertion of LBIST and MBIST and all of the test capability such as clock control. If you do not do this at RTL, your verification will be missing a key piece of the system.”

In addition to fault simulation, formal verification is starting to play a larger role. “You can use a fault simulator or you can use formal, and then you can make precise estimate about which faults will be observable and those that aren’t,” says Darbari. “The aim is to narrow that analysis to be as precise as you can so that you are left with no unknowns.”

Not all faults are visible during design at the RT level, and fault simulation is not capable of evaluating all of the faults at the system level. This means that different techniques apply to the block and IP levels compared to both the system level and the implementation level.

There are top-down methods that assess the likelihood of system failure. “Failure Mode Effect Defect Analysis (FMEDA) is a rigorous formal process to go through the system, understand the reliability of each component, understand what safety mechanisms you are going to insert, how much Coverage do you get out of it, and what is the overall confidence that you will be able to catch a failure,” explains Knoth. “It is an effective mechanism that starts at the system architecture and works its way down to place and route, where you are worried about things like redundant vias. It is a process that helps to guide all aspects of safety.”

Systemic failures
Finding systemic failures is the cornerstone challenge of modern verification, and is simply stated as how to determine the design does everything in the specification and meets all of the requirements. While the industry has a plethora of tools to address this challenge, it is one of the toughest challenges that the industry faces.

“How do we know we have done enough?” asks Darbari. “Coverage can direct you towards gaps, but you also need to consider completeness. Have all of the requirements been verified properly? Were there any over-constraints in the testbench, were there hidden bugs in the implementation?” These are the questions that keep verification engineers up at night.

This subject continues to be one of the most heavily invested areas of EDA. New tools and techniques are being developed all of the time that add capacity, performance and capabilities of the tools. Formal verification is one area that is seeing a lot of advances in recent years.

Verification a single simple concept. It is the comparison of two models, each developed independently, such that the probability of a common design error in both models is small. Those two models are most commonly defined as being the design and the testbench. Tools then help to perform pieces of that comparison because we lack the ability to formally prove that the functionality of both are identical, or that one is a subset of behaviors (design) contained in the other (testbench).

The industry is getting closer to that capability. “Sequential equivalence checking allows you to take two copies of the design and prove that they are functionally equivalent,” explains Darbari. You may have one model in C and another in RTL. The basics haven’t changed much. The methods to solve the problem haven’t changed. But we can do this earlier in the design flow, and we now have technologies, such as formal, that are a tremendous help.”

In addition, Accellera is working on the Portable Stimulus standard. “This enables a model of verification intent to be defined at a higher-level of abstraction than was possible using the current UVM methodology,” explains Adnan Hamid, CEO of Breker Verification Systems. “This model defines all of the functional paths through the system and will help ensure that system-level verification is complete. The model will also enable many other kinds of analysis in the future, such as system-level fault injection and the definition of functionality that should not be possible.”

The hardware is only one half of the system. “I am hoping that the software side is applying the same kind of rigor to prevent systematic errors,” adds Knoth. “Systemic errors are more insidious and difficult to catch.”

In addition, new hardware architectures are bringing into question some aspects of verification. “The implementation of functional safety features in highly complex multiprocessing chips performing tasks like neural network processing for autonomous driving is a particular challenge,” points out Shuler. “Even if there are no faults in the system, safety can be compromised by external influences like system performance overload or weather conditions, and this can lead to misinterpretations of incoming data.”

There is an ISO 26262 sub-working group working on ways to address this problem. “We call this ‘Safety of The Intended Function (SOTIF)’, adds Shuler. “The team is trying to address the issue of how to prove a system is safe even though it is so complex that it is nearly impossible to verify completely and correctly.”

Today, all approaches for eliminating systemic errors are an incremental step on top of the functional verification process. “If you have the right processes in place, the right technologies being used by the right people for capturing the requirements properly and you build the testbenches properly – then it is not much overhead,” says Darbari. “Yes, for safety there are compliance things to be met. There is more documentation for ISO 26262 compliance and for random faults there is extra work that has to be done. But for systematic failure analysis, you just need a slightly more inclusive mindset than you have for functional verification.”

Malicious faults
For random and systemic safety analysis, the industry has a track record and has built solutions that help make them tractable problems, but security is an evolving area. Security is significantly farther behind in terms of understanding the problem, building solutions, analyzing impact and measuring effectiveness.

Security is about protecting the system from malicious attacks and this goes beyond the notions of functional verification. Functional verification is about ensuring that intended functionalities work correctly. Security is about ensuring that there are no weaknesses that can be exploited to make the system perform unintended functionality. This is about handling the known unknowns and the unknown unknowns.

“There are some best practices in the industry but there is no good set of metrics for assessing how secure something is,” says Mike Borza, member of the technical staff for Synopsys’ Security IP. “People tend to make qualitative statements about it, and they also tend to use the best practices to evaluate the security of a device. You find things such as security audits that look to assess the common vulnerabilities that we know about, and what is being done to mitigate or eliminate those. Unfortunately, that is the state of the art.”

Darbari attempts to put an analytical framework in place. “With safety, we are looking at single fault effects and for that we know what full coverage means. We can analyze that every fault is detected using a number of mechanisms. But for security, attacks are not equivalent to single faults. How can we measure the effectiveness of the protections for security? If we could model and specify what the interactions need to be, what must happen, what must not happen – intentional versus unintentional, then you can come up with a good set of checks that could be specified. Security is more systematic analysis but it will always be difficult to get a handle on completeness.”

It’s also one of the big challenges the industry must grapple with as more electronic content finds its way automotive, medical and industrial applications, and as those systems increasingly are connected to other systems that may or may not be secure. The value of the data in those systems is high and growing, and that makes it an increasingly attractive target for very sophisticated hackers.

Coming in part 2: Solutions and methodologies.

Related Stories
Verification And The IoT
Experts at the Table, part 1: Application-specific verification, and why quality may vary from one market to the next; why different models are ready at different times.
Auto Security And Technology Questions Persist
Fallout could be slower adoption of autonomous vehicles as ecosystem proceeds with caution.
Can Low-Power Devices Be Secure?
Demand for low-power, high-performance devices also calls for security measures.