Improving Concurrent Chip Design, Manufacturing, And Test Flows

Realizing the benefits of digital twins is more complicated than translating data between tools.

popularity

Semiconductor design, manufacturing, and test are becoming much more tightly integrated as the chip industry seeks to optimize designs using fewer engineers, setting the stage for greater efficiencies and potentially lower chip costs without just relying on economies of scale.

The glue between these various processes is data, and the chip industry is working to weave together various steps throughout these flows by utilizing that data in more places. The goals include better visibility when making tradeoffs in heterogeneous designs, faster time to market, and improved reliability in the field. That will be enabled by in-chip and in-system monitoring capabilities, and bi-directional sharing and utilization of data back and forth across the design-through-manufacturing flow.

“There are almost 23,000 job openings for electrical engineers worldwide,” said Duane Lowenstein, solutions fellow at Keysight. “We don’t produce enough electrical engineers. Are we going to fill that gap in the next 5 or 10 years? Probably not. Part of the reason is what’s called the ‘Grey Tsunami,’ which is that 35% of the population is 65 or older, and it’s growing by 10,000 people a day. We’re not producing 10,000 new people per day to backfill some of these jobs. And even before the pandemic, the average person only spent 4.1 years at a company in the U.S. What does that mean for us as a company? We get a great engineer, we’re excited, and it takes a year to two years to train them to get an understanding our processes. Then they work for us for about a year or two, and they’re gone. This is a big problem.”

It’s also one of the main drivers for increased use of digital twins and related methodologies represents. “Digital twins take work out of the system, and we need to be able to do things today with fewer engineers,” Lowenstein said. “We don’t we have enough people, but even if we did, we would have to change the processes to get that shorter time to market, and to be able to translate that and have repeatability everywhere.”

Alongside of this labor shortage, there are concerns about chip and materials availability and the robustness of the supply chain.

“I can’t go to a single supplier and have the issue of that supplier experiencing an earthquake or a tsunami or a fire or a pandemic, and suddenly I’m shut down.’ This is exactly what’s happening with the chip manufacturers,” he said. “On top of that, there are capacity constraints, but greater production customization. That means I have to be able to design faster. I have to design more accurately, because that impacts things like yields and predictability. If I can’t build something immediately, and slowly ramp up, that time-to-ramp is a problem because that is a capacity constraint. If I could ramp up in a week versus six months, because my digital twin is exactly what I thought it was going to be, that helps.”

Breaking down silos
In the past, design and test were separate domains. “You designed your product from a functional perspective, and you didn’t even really care what the physical implementation of the product was like,” said Rob Knoth, product management group director in the Digital & Signoff Group at Cadence. “Even that was separate. As design technology evolved, functional design of the product and the physical design of the product started to merge. There was a huge incentive to do that because it shortened time to market and helped reduce margins. In the past 5 to 10 years, test has started to become part of that same conversation.”

It is well accepted that test can no longer be ignored by designers for some key reasons.

“In safety-critical, high-reliability products, you want to make sure that you have zero defects, that you have the long lifetimes expected, and that you’re handling safety in the right ways,” Knoth said. “For these reasons, test increasingly is starting to creep into design planning. That’s coupled with advanced nodes, and making sure that you’re testing for all the new defects on these advanced nodes and keeping an eye on them when they’re in-field. Suddenly, all three parties are sitting at the table with very equal voices. It’s not so much that the designer has to prepare for test, but product design is considering test as one of the three important qualities. You have the end function of your product. You have the physical realization of your product. And you have the test aspects. These are really concurrent activities, as opposed to something that follows along.”

This means the design team needs to keep in mind the physical realities of test — it’s going to consume some area and require routing. And with advanced digital designs, it’s also essential to understand the power, performance, area, and congestion impact of test. At the same time, there are still gaps in how the various pieces intersect, so flows will need to be adjusted as the gaps are filled in.

“There are techniques out there that in some companies are very widely adopted, but are still lagging in things like RTL insertion of test. We see that gaining ground but there’s definitely still places out there that they’re not,” Knoth said.

So just adding DFT, running the test structures in the tester, and getting a pass/fail is not sufficient anymore. “We have much more complexity,” said Richard Oxland, product manager for Tessent Embedded Analytics at Siemens Digital Industries Software. “We’re down at 3nm, so things just got harder. How do we deal with that? Well, we need to throw the kitchen sink at it. But we also cannot massively increase the cost of doing the tests in the tester, along with whatever follow-on costs we get. This means we need to get smarter, as well.”

Especially for the functional verification of safety-critical systems, it’s vital to have test content present in RTL rather than at the gate level. “RTL is what’s being driven with so many of the functional verification simulation jobs being done ahead of physical design, and in parallel, to ensure that you’ve got a working product out,” Knoth noted. “So the more that test is critical to the functioning of the design, the more that test IP has to be present in the files that are being used for functional verification. There’s been a big migration of that content up from the gate level to RTL so that it’s able to be seen by the functional verification.”

Using data differently
This is where digital twins become especially important. A digital twin is a digital representation of a physical thing, which is enabled by doing some kind of sensing and monitoring. “If you don’t have sensing and monitoring capabilities, you can’t have a digital twin,” Oxland said. “You need to have something in there that is reporting important metrics on a regular basis. That’s what helps drive business value.”

Oxland noted that two different types of digital twin applications can be used to drive business value. “One is a closed-loop application, in which you might collect data through the design flow and feed data into a database in the design phase, in the emulation phase, and maybe in the manufacturing phase. Then, in silicon, you’re also feeding data in, and you can correlate how well these correspond to each other as you move down the flows.”

The obvious application for this closed-loop digital twin is to improve performance, but it may have a significant impact on yield and reliability, as well. “It means you can start to say, ‘I made these design decisions, I tested them in emulation, but in silicon it didn’t work out like I was expecting.’ But you have all the data to be able to close the loop and see the design decision was wrong because, for example, you should have put in an offset of 5nm. This approach is like an extension of Shift Left.”

There are also workflow-style applications that contain the digital twin of the real silicon, for which alerts can be set up. So if is an interconnect latency is greater than 500 milliseconds, for example, it can trigger an alarm indicating there is a problem somewhere in the software stack that needs to be fixed.

What gets monitored on the chip can be classified physically or structurally, such as PVT, then parametrically, with on-chip Agents from proteanTecs. The company leverages what it calls “deep data analytics” based on chip telemetry, using multi-dimensional Agents that operate at both test and in mission-mode. So those agents can monitor performance in real-time and send alerts about degrading performance due to aging and latent defects that were not caught during manufacturing. In addition, the technology can be used for operational, environmental and application monitoring, which measures workload and software stress on the hardware, and for monitoring interconnects in advanced packages. 

“At this level, you may want to detect bus latency,” Oxland said. “If you’re monitoring those things in different levels, you can use them for different purposes. ProteanTecs has a great story about the parametric sensor for aging, which plays into reliability, which then allows you to create the business value in predictive maintenance. So instead of having downtime, you can say, ‘I’m just going to send an engineer out now because looks like in two weeks this chip is going to fail.’ That may save you from violating your SLAs or creating another kind of emergency.”

Embedded analytics are particularly important for examining the interaction between a particular version of software and the design. “We have the ability to see at a more fine-grained level what’s happening, and what the interaction between hardware and software is,” he noted. “Maybe all the tests you run of the software look good. You put it out into the wild, and some end user does something really wacky that causes an issue, but it only occurs once every 100 billion cycles. How are you ever going to detect that? If you have an automatic way of sending an alarm when you’ve got that very long latency detected on a chip, you can fix it. You can send that to the software guys and say, ‘Hey, go check it out.’ In our IP, there are circular buffers, and cross triggering mechanisms that allow you say, ‘This is weird. Next time I see this, I want to be capturing the last 100 cycles, both in this part and in this part.’ That gives you the forensic data.”

Moving into the test space, there are outlier use cases causing problems.

“One of the things that Meta published recently concerns silent data corruption,” Oxland said. “Nobody knows why they’re there or how to find them. However, if you have test structures on the DUT, you can trigger those based on weird events. Maybe you have some point in the day where the chip is not being used as much, and you could take it down, run the test, collect the data, analyze on the chip, or send it up to the cloud and analyze it there. The more you can do on the chip the better. If you’ve got monitors, you can detect issues, and if there’s a structural issue, you can root cause them with a test — and that all can be automated.”

What’s missing
The more physically aware various aspects of test are, the better it dovetails with the design process.

“There’s always going to be room to improve on this,” said Cadence’s Knoth. “Moving test content to the RTL space sometimes makes that job harder. When you’re inserting something during implementation, it’s very natural and easy to understand, ‘This is test, I can manipulate it differently than the functional circuitry.’ But when you’re inserting stuff at the RTL level, that can be a little trickier. So there’s always going to be room to improve the implementation flow, the verification, flow, and so on.”

Also, the chip needs to be outfitted with the right kind of sensors to indicate when more tests are needed, or if something needs to be re-tested. For example, maybe the temperature in one part of a chip is high, or a transaction is taking too long.

“We’re still feeling our way into exactly what those kinds of triggers are going to be,” Oxland said. “We need ways to understand all this complexity, and how the tests can be better directed ultimately. We sometimes say, ‘Design for more tests, and design for more than test.’ You have to do a bit of both, somehow — smarter, more comprehensive, cheaper test — but also augmenting test with other types of data, such as PVT, parametric, and functional.”

Some of this has been done for years in markets where reliability is considered critical. “We wanted to know exactly what we were delivering,” said Simon Davidmann, CEO of Imperas. “If there were bits missing, we needed to know. If there were bits that were not the quality we expected, we needed to know. This methodology gave us the ability to choose when we were ready to deliver the product.”

Imperas devised a test-driven design strategy in which tests are implemented concurrently with the work being implemented. “When we’re coming up with a project and writing a specification, we spend a lot of time on planning the testing of it so that we can know when we’re done, which is like hardware design,” Davidmann explained. “We write a test plan as we’re evolving the [processor] model. The person writing the test plan tends to be a different person than the person implementing the model, so there are two people reading the specification. One is implementing it in a simulator in the model, and the other one is implementing it in a bunch of tests. Sometimes in smaller projects it can be the one person, but often it’s two people. In some projects we use three, with some people that do the coverage on it determining that what we need to ensure is covered. All members of the team take it from the spec. One implements it. One writes the test. One determines how to measure it, because it’s not just code coverage, it’s functional coverage. We use this test-driven methodology so that as we’re evolving the product. We know where we are with it in terms of its quality, and we work at a very detailed level and do white box point tests for every capability and feature.”

To bring these concepts to life requires modeling, and then contrasting that against what’s happening in the real world. “You want the data from real life and want to reflect that back into the digital twin,” said Robert Ruiz, director of product marketing at Synopsys.

Ruiz noted that ATPG tools typically deal with abstractions of design elements by generating some stimulus. “‘Let me check what the output is, and apply that on silicon on the tester.’ That has been okay in years prior, but improving the model requires a deeper dive into this. What’s fairly new as far as actual usage we’re seeing in production is to say, ‘Let’s take a look at the SPICE-level netlist, the transistors, and closer to the digital twin concept let’s inject defects. Let’s break open the wires, let’s introduce shorts in some cases. Then, let’s not run the ATPG. Let’s run something closer to the real world, which would be more like a SPICE simulation, and see how it responds.’ And then we reflect that back.”

There are other cases where at-speed test is one of the primary ways to get a high-quality test, which most advanced designs do. If a design is supposed to run at 3 GHz, ideally you want the design to run at 3 GHz internally. The traditional ATPG way is to assume it’s going to run at 3 GHz, then create some type of test. However, the ATPG tool doesn’t really know how to do this, due to the lack of modeling or a connection to a digital twin.

“In recent years, we’ve taken information from a static timing tool, injected that in, and now the ATPG tool says, ‘Based on this information, I know that Path A is a longer path. So I’m going to try to make the test go along this path rather than a shorter path. By doing so, I’m more likely to capture a type of defect, because the longer path is 3 GHz and the shorter path is 2 GHz. Timing models can be improved by looking at the silicon, and path margin monitors can measure the timing of the paths from actual silicon. That data could come back to the timing models, and then that information feeds this, and the loop is completed.”

While a number of approaches are available today, they may not be connected with other pieces that could be combined to create a digital twin-type simulation model.

“A lot of this evolves over time,” Davidmann noted. “We developed a methodology for our products. We don’t call it digital twin, but we use simulation, we build other models outside what we do. For example, if we’re working on something quite complicated to do with, for example, cryptography, or DSPs, we will find some encryption algorithm such as in C, use that as our reference, then implement it in the language we’re dealing with. Then, we’ve got a golden reference, which is effectively a simulation digital twin type of context, but very micro-level. It’s exactly the same concept as a digital twin because when we deal with our customers in the RISC-V world, what are they doing? They’re using us as their digital twin. They’ve got the RTL, and they want to know if the RTL does the right thing. It’s a whole verification strategy. Our verification product for RISC-V is like a plug that sits behind their products. When they’ve got a whole simulation in their RTL, they can plug our technology behind it, which has its fingers around it and watches what’s going on. It’s a twin of the complete functionality of a core, configured to be exactly what they’ve got. It sits there monitoring every event, and if it finds things it doesn’t like, it reports it.”

Synopsys’ Ruiz expects that as monitors evolve, they will grab data and improve models, not just for ATPG, but for design itself. “Better timing models, of course, don’t just benefit ATPG,” he said. “They improve the ability to put out a design that will meet the performance requirements, along with other improvements for the EDA flow.”

Conclusion
The greatest benefits are seen when these various approaches are tied in with the system deployment, where data is gathered, analyzed, and put to work. There are many moving pieces to this puzzle, and the evolution is complex.

“Digital twin is more a philosophical connection to the solution than anything else,” said Keysight’s Lowenstein. “How do I put it together? How do I change my philosophy toward accepting this, because it’s going to happen?”

He predicts that in 10 years, this will be the primary way chips and systems are developed and tested. “The company that comes up with a very simple way of doing it, with all the connections, is going to make it. It will be very much like an SAP or Oracle implementation. When the idea of these systems came up originally, everyone said it was so complicated, and no one’s ever going to accept this. Now, everybody has an MRP system. It’s going to be the same thing with digital twin-like systems.”



Leave a Reply


(Note: This name will be displayed publicly)