The number of critical design metrics is expanding, but the industry still grapples with their implications.
Safety and security are emerging as key design tradeoffs as chips are added into safety-critical markets, adding even more complexity into an already complicated optimization process.
In the early days of semiconductor design, performance and area were traded off against each other. Then power became important, and the main tradeoffs became power, performance and area (PPA). But as chips increasingly are used for critical functions in automotive, medical, industrial and avionics, safety and security issues have become much more of a concern, with far-reaching impacts on design.
Put in perspective, hardware vulnerabilities such as Spectre exist because of a focus on performance over security. And while side-channel attacks can be prevented, the solutions consume power and area. On top of that, there is no way to be prepared for every kind of attack when a product is conceived. That means products must be capable of being updated and improved after deployment, which affects the entire PPA equation.
“There’s a long-tail problem,” said Martin Scott, CTO at Rambus. “You expect these systems to be secure for 10 years or more.”
The chances of that happening are slim, however. David Patterson, Emeritus Professor for UC Berkeley and vice-chair of the RISC-V board of directors, talked about “the sorry state of security” in his keynote to the Design Automation Conference. “For those in the field, this is embarrassing. In the early days, we used to have lots of stuff for security, but they weren’t used by operating systems and they were expensive, so they disappeared.”
Security can be active or passive, and the most secure systems include both. Passive security involves such things as storing authentication keys, such as Arm‘s TrustZone, while active security uses power to actively monitor a system for changes in behavior. For that reason, active security is utilized far less often than experts say it should be, even when there are hooks built into IP, firmware or software to make it all work.
Design for safety
Today, when the subject of safety comes up, it often is related to the automotive industry and ISO 26262. Yet as more devices are connected to the Internet, safety and security overlap to varying degrees. This is certainly true in automotive applications, but it also is true for avionics, medical and industrial applications.
In the past, the surest way to address safety concerns was through duplication. But as safety and security increasingly cross paths and demand resources, that approach is becoming less popular—particularly where power and performance are considered critical.
“You can duplicate everything, and we have seen this,” said Kurt Shuler, vice president of marketing for ArterisIP. “That is the not so smart way. Within an interconnect, you could have multiple paths, so that if there is an issue along a certain path we can send things over another path. Just like with the heaviness of TCP/IP over the Internet, if you have multiple paths, you need more information in the packets. Every bit of logic in that interconnect that sees the packets also has to have more logic to deal with the additional information. That ends up being a stupid way to do it because you end up burning so much power and you don’t get any benefit over the smart way.”
For interconnects, the smart way is to protect through unit duplication only those things that affect the content of a packet. “You selectively duplicate things that either create packets, change the content of a packet such as a firewall, but what you do in between those different blocks of logic in the interconnect is you protect the paths or links with ECC or parity, depending upon the integrity level you want to attain,” Shuler said.
Duplication adds overhead in terms of power and performance.
“Duplication, if used blindly, will of course lead to greatly increased power consumption, and that is why such technique needs to be used wisely and with intelligent design,” said Antonio Priore, senior functional safety manager at Arm. “We carefully select when to apply duplication or alternative redundancy techniques. The implementation of duplication is also significant, and there is an opportunity to conserve power through selective and dynamic comparison of outputs. It should also be said that duplication is recommended primarily for the highest levels of the integrity spectrum e.g. ASIL D in the ISO 26262, whereas at lower levels, other, more power-efficient techniques like software test libraries, will provide sufficient protection.”
There are multiple ways to tackle this. “Parity, CRC, dual-lockstep processors and software test libraries are examples of some types of safety mechanisms being implemented in ICs today,” said Bryan Ramirez, strategic marketing manager at Mentor, A Siemens Business. “But as we transition to from fail safe to fail operational devices, the functions must not only detect the problem, but it must find a way to either correct it and continue operating or at least get to some safe state (i.e. pull over to the side of the road). As a result, designs will need some level of redundancy to be able to ‘repair itself’. The challenge will be finding intelligent ways to do this efficiently because redundancy increases cost and power. This will require a holistic approach to the safety architecture that considers solutions and interactions across the entire system.”
One of the ways around duplication is to use other systems that are already in a device in order to avoid a serious accident. “You see this with brake by wire, which used to have a separate failover system,” said Stephan Gerth, group manager for functional modeling and verification at Fraunhofer‘s Division of Engineering and Adaptive Systems. “That approach is not used by the automotive industry anymore. It’s the same in aviation. Now, if it fails, you might find another system, such as the infotainment system, taking over.”
Designs for automotive autonomy are pushing some of the boundaries of semiconductor technology, and that is adding to some of the safety risks. “There is a set of JEDEC characterizations for soft error rates within semiconductors,” said Shuler. “We have been finding that those numbers have gone from being conservative to no longer being valid as you get to smaller geometries. Transistors are scaling differently than the wires, and you start getting quantum effects and weird aging effects. There is a lot of work going on right now to understand that at the transistor level.”
This becomes even more complicated when multiple systems fail, said Fraunhofer’s Gerth. “Does the system know how to cope with those faults or even how to detect them? If you have AI systems, do all of those systems work the same way or did a fault trigger something that it can’t cope with? There is a functional safety and security aspect to all of this, as well, that needs to be taken into consideration, because it may be easier to hack into these devices than in the past.”
Power optimizations create safety risks
Many design teams have been deploying clock gating, power gating and other techniques to reduce power. While these techniques slightly modify area and performance, they can also impact safety. “Clock or power gating can impact the power delivery network,” explains Preeti Gupta, director of product management at ANSYS. “Consider the following graph of power consumption over time (Figure 1). The bold line represents power consumption for the design when clock gating is not enabled. The dotted line is the same design when clock-gating is enabled. Low-power techniques change the current when you go from a low-power consumption mode to a high-power consumption mode. While the power floor and average power consumption have gone down the rapid change in current can couple with the package inductance which leads to a voltage drop on the power delivery network. The di/dt events have been exacerbated by using low-power techniques. Inrush currents can also lead to problems.”
Fig 1. Low power techniques affecting power integrity. Source ANSYS
Similar problems are caused by high-frequency switching noise. “This is due to electromagnetic coupling or crosstalk between power grids of neighboring blocks,” said Magdy Abadir, VP of corporate marketing for Helic. “This type of coupling can occur through the silicon substrate, on the same die, or in 3D-IC designs, through an interposer or direct coupling between stacked silicon dies. As we continue to increase the level of integration including the use of advanced 3D-IC packaging technologies, functionality, security and safety could be compromised by EM coupling effects.”
There can be other hidden problems with power gating. “If power gating occurs at a frequency comparable to the 1/sqrt(Lgrid Cdecap) grid resonance frequency, resonance can occur with catastrophic failures,” adds Abadir. “The probability of such event occurring is non-trivial since the grid decap time constant is always designed to be significantly higher than the clock frequency to avoid resonance with the clock. Clock gating occurs over many clock cycles and can coincide with the grid LC resonance frequency.”
Many of these problems need to be fixed in the back-end of the development process and can reduce the impact of those optimization. “For a 7nm design with a 500mV supply, such fluctuations can be as high as 25% to 30% of the nominal supply,” said Scott Johnson, principal technical product manager for ANSYS. “This may be significantly greater than specified available margin, and makes it nearly impossible for design teams to meet the required threshold without increasing the power grid uniformly across the chip. That leads to timing congestion, routing bottlenecks and increasing chip size.”
Aging and product health
Showing a design to be safe when it is first fabricated is important, but it has to remain both safe and secure during its expected lifetime. “Analog IPs, such as PLLs, DLLs and LDOs, require constant bias currents and thus balancing aggressive lower power consumption with safety requirements can be especially challenging,” said Mo Faisal, president and CEO of Movellus. “The functionality of analog is dependent on the accuracy of these bias currents and voltages. Aging is a function of how long a circuit conducts current which then causes physical damage to transistors over long term usage. Always-on currents will cause damage sooner because transistors don’t get a break.”
Digital circuits do not have constant currents and have higher noise margins. This means that they can tolerate more degradation without loss of proper functionality and performance. “It is also easier to utilize DFT scan chains, health monitoring and redundancy with digital logic,” adds Faisal. “First, inserting DFT scan chains enhances observability and fault coverage, which increases safety and predictability. In addition, the digital implementation allows health monitoring of IPs, alerting the system and users as the chip degrades over time, so that appropriate maintenance can occur prior to a point of failure. Finally, complete digital IPs for analog functions enhances safety because the smaller silicon area and more flexible digital configurability enable greater redundancy.”
On-chip monitoring helps in quite a number of ways. “Trivially obvious, but worth stating, is that on-chip monitoring makes it easier to ensure, during development, that the system is doing what it said it’s doing,” said Rupert Baines, CEO for UltraSoC. “Verification and validation becomes much easier. Reporting and traceability, which is a very onerous part of ISO 26262 and most of the other safety standards, becomes much easier. You actually have a record of how the system behaves and proof that it behaves as expected.”
Fig 2: Spotting systemic and random failures. Source: UltraSoC
Once a system has been deployed, on-chip monitoring can add even more benefits. “You can potentially spot when things are moving ‘out of tolerance’,” adds Baines. “For example, there might be an unexpected but non-catastrophic behavior on the chip or in peripheral components/ IO that precedes a more serious failure.”
Some of the added logic can also be used to detect intrusions or assist with lockstep mechanisms.
Side-channel attacks
A side channel attack is any measurable characteristic that allows information to leak. One of the primary side channels is power consumption. “Chips that employ cryptography blocks can provide information to hackers through the measurement of power,” warns Ansys’ Gupta. “The way this works is that they monitor power consumption, and from that they can run a differential power analysis, or even regular power analyses, to figure out the frequency domain spectrum. From that they can get to the key of the device.”
Solutions come with a price, though. “Preventing the most severe side channel attacks requires some additional power, some additional area and computation, in order to ensure that even with millions upon millions of cycles of attacks you don’t leak information,” said Rambus’ Scott. “On one extreme, side channel attacks require additional area. You can’t escape that. But security solutions are a continuum. The challenge is striking a balance between the risk/reward of an adversary.”
The solutions space often involves altering the design. “One technique, often used for safety is the duplication of logic, but another is to introduce logic such that your power waveform becomes uninteresting and does not provide a direct leak of information,” adds Gupta. “Unfortunately, this may produce a verification nightmare in that this logic produces no useful function, but has to be verified for safety reasons.”
Power architectures are being used as security mechanisms, as well. “Some researchers suggest differential logic that ensures almost constant current over any input transition, and hence the current on the PDN is constant and bears no information about the processed data,” explains Abadir. “Obviously, that adds non-trivially to area and power overhead. Another idea is to add dummy currents to the network to conceal the currents from the processing units from an attacker. Again, that adds area and power overhead.”
Some attacks target bad architectural decisions or those that may have favored performance over security. “With the Spectre attack they use speculation and figure out timing and leak information at a rate of 10kb per second,” said Patterson. “But that is not the end of it. There are a lot more attacks on the microarchitecture that are underway. Spectre is actually a bug in how we define computer architecture. We didn’t care about timing as long we got the right answer. Timing can leak information, so we have to reinvent computer architecture.”
Tools and IP
There are precious few tools today that help to understand safety, let alone build it into a comprehensive optimization flow. “We are seeing a lot more use of SystemC modeling early in the chip design process,” said Arteris’ Shuler. “You have to make a lot of assumptions, but it does provide an idea while you are designing the architecture of where the trouble areas are going to be. The EDA industry is coming out with system-level power estimation technologies, but there is still room for improvement.”
Having stimuli with enough context about what is going on in the system is a very difficult but necessary step. The recently released Portable Stimulus Standard (PSS) from Accellera may help in this regard. “Having a set of system-level scenarios is the first step to understanding how a system is behaving,” said Adnan Hamid, chief executive officer of Breker Verification Systems. “Once this exists you are able to perform meaningful analysis. Users that have adopted graph-based verification techniques are in a stronger position to understand the implications of optimization they are making. While safety and security have not yet been added to the requirements list for PSS, we welcome the industry’s ideas in this area. Now is a good time as the committee plans the features that will be added for the 1.1 release.”
IP adds another element into all of this. While big companies stand behind their products for liability reasons, it’s not clear what the goals are for open-source software.
“If you look at software, IP reuse has been massively the case,” said Aart de Geus, chairman and co-CEO of Synopsys. “In software, there’s a whole bunch of people doing open-source software that is being re-used. There’s a lot of very effective reuse. But how safe is it? Who did it? Is it secure? If there’s an issue, will anyone fix it? You get all of these issues.”
Conclusion
The industry cannot afford to treat safety and security as an unconnected layer in the design process. Architecting and optimizing systems has to consider power, performance, area and safety/security. A system that does not elevate security to be an important design consideration cannot be safe, and concentrating only on performance will lead to unforeseen consequences down the road.
—Ed Sperling contributed to this report.
Related Stories
Safety Plus Security: Solutions And Methodologies
Part 2: Connecting to the Internet adds new demands for safety-critical markets.
5nm Design Progress
Improvements in power, performance and area are much more difficult to achieve, but solutions are coming into focus.
Chip Aging Accelerates
As advanced-node chips are added into cars, and usage models shift inside of data centers, new questions surface about reliability.
Ensuring Chip Reliability From The Inside
In-chip monitoring techniques are growing for automotive, industrial, and data center applications.
Tech Talk: ISO 26262 Drilldown
What’s required to gain a solid foothold in the automotive electronics market.
ISO 26262 Statistics
Tech Talk: The statistical underpinnings of safety standards.
Leave a Reply