Adding resiliency and robustness doesn’t have to be expensive, especially if done correctly to increase yield and reliability over the lifetime of a product.
Technology developed for one purpose is often applicable to other areas, but organizational silos can get in the way of capitalizing on it until there is a clear cost advantage.
Consider memory. All memories are fabricated with spare rows and columns that are swapped in when a device fails manufacturing test. “This is a common method to increase the yield of a device, based on how much memory is on-chip,” says Lee Harrison, automotive test solutions manager for Siemens EDA. “There tends to be a threshold where IC manufacturers require memory repair to be present. Otherwise, the project yield will be too low. Then, during manufacturing test, when a defect is found in a repairable memory, the memory BiST will identify if it can be repaired and will carry out the repair. That repair data ultimately will be stored in an efuse type memory on chip, so that from then on the device will be 100% functional, utilizing the required memories.”
In other cases, the amount of available memory may just be reduced and the chip sold as a lower priced part.
But just dealing with memory is not enough to meet the safety requirements of the mil/aero or, more recently, the automotive industry. Not only must they be 100% functional at the time of manufacturing test, but any defects that develop over time also must be detected, and some corrective action taken.
“Depending on the safety requirement for an automotive device and its ASIL requirement, there will be different levels of functional safety built into the design,” adds Harrison. “This functional safety will ensure the device is functioning correctly, and that it will be able to flag any concerns or issues that arise during the lifecycle of the device. Functional safety can take many forms. Figure 1 shows a typical device, which contains a mix of structural, functional and system-level functional safety mechanisms, all working to check for anything unexpected that could arise.”
Fig. 1: Safety mechanisms in a typical device. Source: Siemens EDA
Additional application domains are getting serious about the inclusion of redundancy to either increase yield or to ensure robust operations, even in the event of failure. “All of the servers in the cloud offer redundancy,” says Mary Ann White, product marketing director in the Digital Design Group at Synopsys. “At the same time, the ICs inside those servers also have to offer redundancy. So we’re seeing more and more customers asking for reliability methods for semiconductors inside the cloud servers.”
There are many examples within the AI/ML processor space where companies are talking about increasing yield by adding redundancy into their systems. “There’s a wonderful confluence of different application segments here, and this isn’t the first time that this happened in semiconductor design,” says Rob Knoth, product management director at Cadence. “Frequently, one type of product group hits a problem earlier than others. Technology has to be created to solve that problem, and then inevitably another product segment benefits from it. We saw that with power intent. It was developed to help mobile electronics, but now you are hard pressed to find a semiconductor that tapes-out with power reduction circuitry. I see a similar sort of thing happening around functional safety requirements, as well as high reliability applications.”
Technologies that are developed for one domain, over time, become less expensive for use in another domain.
It’s also important to understand the life expectancy of chips and systems under varying use scenarios, and what exactly needs to be redundant under what conditions.
“The automotive industry used to make sure that all the electronics components would last 15 or 20 years, with a mission profile that 95% of the time the vehicle will not be moved,” said Roland Jancke, department head for design methodology at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “So only 5% of the time it will be functioning. But this will change in the future. If you think about electric cars, some of them will be working 24/7. So now we need to rethink that formula. There’s another issue here, too, which is that traditional automotive electronics used to be developed at 350nm or 180nm. But with all the compute power needed to process data for autonomous driving, we need to use the latest process technologies, which might be 7nm or 5nm. And if we want to use those for safety-critical applications, then we have no choice but to think about replacing them because they are not meant to last 20 years if they’re used 24 hours a day.”
Increasing yield
When designer and manufacturer are one and the same company, such as we often see in memory companies, yield and design robustness go hand in hand.
“I would caution against using the word yield,” says Brandon Bautz, senior product management group director for the Digital and Signoff Group at Cadence. “From a sign-off perspective, I don’t talk about yield. More commonly the word used is design robustness. And you may ask, ‘Isn’t that related to yield?’ Let’s just say yield is the fab’s business, robustness is a model that EDA companies have built based on available input collateral. We have algorithms to quantify the robustness of a given circuit. And the real name of the game is to maintain a high level of robustness, while not sacrificing power, performance, and area to achieve that.”
This is a silo that needs to get busted. What everyone cares about is how many working chips can be obtained from a wafer, given the fact that no foundry can provide defect-free wafers. If they could qualify their defect rates and distribution, a design should be free to implement strategies that can work around expected defects, and everybody gains. That means test will accept some defective die, just as it does for memory arrays today, and the chip will be configured to get around the defects. That does not make it a defect-free die, but it does mean that even a small increase in die size ultimately may reduce costs.
AI/ML processors are a great example of this. “There are some massive tile-based designs that are reticle-limited, struggling with yield,” says Cadence’s Knoth. “Defects center around very common sorts of problems, such as pieces of the circuit not working. You don’t want to throw out the whole die. Various strategies have been developed throughout the years around the concept of redundancy. What we see is that some of the techniques, traditionally only thought about for automotive SoCs, are now starting to trickle into the data center.”
But this is not just about manufacturing yield. It’s about maintaining correct operation over the lifetime of the product and that requires an additional step. “In-system test is able to identify defects that manifest themselves during the life of the device,” says Siemens’ Harrison. “Logic BiST and memory BiST are commonly used to perform comprehensive testing of logic and memory of the device. If we find a memory defect when running in-system test, this is a new defect that has manifested itself in operation. With the right infrastructure onboard, we could carry out a soft incremental repair. Here we identify the new defect and record the location. The test is re-run, and if the memory is now fully functional, normal operation can resume. The downside of soft incremental repair is that when powered off, the new defect information is lost, and this process will be run each time the device is powered up. With a hard Incremental repair, the device has spare efuse real estate to be able to record the repair programming.”
This can improve reliability. “Dynamic redundancy could be baked into the architecture of the device,” says Cadence’s Bautz. “If these gates fail over time, let me switch over. Historically, people put margin into their designs to cover the transistor aging effects. The question is can I have a smarter, better way at looking at my performance over time, and can I find problems pre-tapeout that may occur 10 years down the road. Can I, through better analysis, confirm that the sign-off margins that I put in place are sufficient for my needs. When you’re talking to the automotive, medical, or mil/aero guys, they all have extremely long life expectancies of their parts. If they can improve the likelihood of functionality X years in the future, that’s very valuable.”
Detecting or correcting
Perhaps the biggest architectural decision is if you want to be able to detect errors, or to correct them. The answer may be different depending upon if you are concerned with transient errors or hard errors. “It is common to find error detection (parity) or error correction (ECC) in memories,” says Synopsys’ White. “You also can do that with groups of registers. At the system level, you could consider dual modular redundancy (DMR) that has two cores operating in lockstep. The thing that’s different between a triple modular redundancy (TMR) system versus a dual modular redundant system is the ability to correct. Dual redundancy enables detection, triple enables correction.”
That raises another question. “If you detect an error, you have to figure out how to get the system back into a safe state,” adds White. “That is an extra level of design that you have to consider.”
Systems that do have multiple replicated systems can do this another way. “Assuming failures are going to happen, how do you safely recover from those failures without having physical redundancy of every subsystem in the entire vehicle, which would cause the cost to skyrocket?” asks David Fritz, senior director for autonomous and ADAS SoCs at Siemens EDA. “There’s a concept called dynamic redundancy. This basically means that in the event of failure you take over resources that were being used for a low-priority task, and utilize them as a replacement for the failed capability.”
It all comes down to cost and benefit. “If you’re cloning blocks to do TMR, it is going to take up more real estate,” says Knoth. “But if you’re good about understanding placement, if you’re good about understanding porosity, if you have integrated engines that can predict what’s hittable early enough in the flow when you’re making placement decisions, or able to adjust them without further gating too much of the design, then you’re not going to have to grow the block as much, or at all. ECC on memory has become pretty standard, and that used to be domain specific. The key thing is the return on investment for this type of robustness. What’s the risk of failure? What’s the probability of failure? What’s the overhead required to manage the failure? This is something that more electronic systems have to be analyzing as they become more pervasive.”
Defining redundancy
For power optimization, the design is augmented with an auxiliary file that defines the power intent. Tools within the flow then utilize that and can automate many of the necessary operations. The same is happening for redundancy.
“In general, we decompose the problem, handling them separately just to make the analysis tractable,” says Knoth. “However, it’s important to understand if there are gaps when you do that. That’s where surgical fault injection approaches help, making sure that you’re not going to have any problems or gaps in coverage in the handover. There is a similar sort of problem that happens when people are doing IR drop analysis. There’s IR drop that is seen on the tester when you are running ATPG vectors, versus IR drops that can be seen when you’re driving down the road. Understanding both of those, and making sure that you’re optimizing and handling them appropriately, is very important.”
What can we expect to find in a redundancy auxiliary file? “We call it Functional Safety Intent (FuSa), and it is analogous to UPF,” says White. “Tools then automate the process. We have built in the ability to define the safety register scheme. Then we do any necessary duplication. With DMR, you have to figure out, when an error is detected, how that gets tied into whatever logic would get it back into a safe state. With TMR, it gets inserted automatically. We also are rolling out the EDC and ECC, which are applied to groups of registers.”
Standards are already in the works. “An Accellera committee has been working on this for a few years now,” says Knoth. “It is bringing together industry leaders to create a standard for electronic communication of the safety intent between tools. That kind of technology is critical for something that has to scale. In the power intent discussion, it’s the same basic principle. If you don’t have an electronic form to communicate intent for both implementation and verification, it’s very difficult for technology to scale and be adopted broadly. This is a necessary evolution, and it’s very important to do this in a standards-based open body if it’s going to be effective and broadly adopted.”
It impacts many tools in the flow. “You have to be thinking about things like dependent failures,” says White. “For example, you want to make sure that clock, or reset pins are split. If you were to have the clock to all three of these TMRs together, and you have an alpha particle hit that, then all three registers would go down.”
Conclusion
Technology once developed to help memories increase yield has become increasingly important. The needs of the automotive industry have furthered research in functional safety, where they cannot afford the expense levels previously seen for mil/aero. As techniques, tools, and flows become more refined, the cost of implementing them is coming down, making the technology of interest to a growing audience. As an increasing number of designs hit the reticle limit, this may be the only way to ensure good yield and to help bring costs down.
Leave a Reply