Bridging IC Design, Manufacturing, And In-Field Reliability

What goes wrong in complex chips, what can be done to fix them, and how to avoid issues in the future.

popularity

Experts at the Table: Semiconductor Engineering sat down to talk about silicon lifecycle management and how that can potentially glue together design, manufacturing, and devices in the field, with Prashant Goteti, principal engineer at Intel; Rob Aitken, R&D fellow at Arm; Zoe Conroy, principal hardware engineer at Cisco; Subhasish Mitra, professor of electrical engineering and computer science at Stanford University; and Mehdi Tahoori, dependable nano computing chair at Karlsruhe Institute of Technology. What follows are excerpts of that conversation, which was held live (virtually) at the recent Synopsys User Group conference. To view part one of this discussion, click here. Part two is here.

SE: How do we move silicon lifecycle management forward?

Goteti: We need standards. The whole industry needs to jump in, and so does academia. So far we only have a vertically integrated model. The same company that puts the sensors in is the same company that looks at the data coming out of the sensors. And then, maybe with the help of EDA vendors, they analyze that data and then make sense of it. But we’re not going to be living in that kind of a world for very long. We’re going to have pieces coming together from very different providers. We’re going to have different types of sensors instantiated with different data formats that are going to be used. And we’re going to have some standardization where this could be shared. So if an integration company can take IPs, put them together, and still make use of the data appropriately, that would be useful. There’s a gap today. Each company has its own set of solutions, and then they leverage the EDA industry to help out. But there’s really not much standardization in any of this.

Mitra: I am a skeptic about all these standardization activities. Oftentimes what happens is that standardization becomes the main focus, and we forget about why we were standardizing in the first place. The real question is what kind of data do we want to collect.

Aitken: You also need to know what you’re going to do with it. So you can gather all this data, but you then have to do something with it. You can say, ‘Well, let’s funnel this through and we’ll have people look at it, and they will make some sort of decision.’ But whatever decision they make has a significant timeline before it actually affects them. If it’s a software turn, that’s going to take a while to redistribute the software for all these devices. If it requires a completely new architecture, then it’s going to be several years minimum before anything responds to it. And the original thing that reported the data is never going to benefit from it.

SE: Is the goal really to just find where the problems are and where there will be problems, or is it to add some level of resiliency — almost like what we’re dealing with ECC memory, where it repairs itself?

Tahoori: That should be the goal, and it should be an added value rather than a headache about who’s going to standardize it and who owns the data. Systems are becoming very complex, and the next generation of designs are extremely difficult to do. SLM can help. Special corner cases that we cannot figure out at timing closure could be taken care of automatically if you have the right knobs to turn.

Mitra: But at the same time, unless you can localize and diagnose, it cannot be resilient. In general, even for ECC, there is actually some circuit that’s determining, ‘Over there, that bit is wrong. It’s flipping from one to zero, or zero to one.’ The user just sees that everything is working.

Goteti: This requires real-time action. If the turnaround time is too long, it’s useless.

Mitra: I disagree. Even if the turnaround time is too long, if that influences your future architecture, it’s still very useful.

Goteti: We’re talking about two different things here.

Aitken: There are actually three different things. So there’s an ECC, which does an immediate correction on an immediate problem. It’s essentially surviving a disaster of some sort, such as power droops and clock jitter. Then there’s adapting to slower-moving changes, which are things like aging that cause the device to slow down over time, but in a predictable and gradual way. That’s a second set of lifecycle capabilities. And then the third set is more analytical are predictive — observe the unknown unknowns as they happen, report back, and let somebody fix it in the future.

SE: So you’re going from predictive analytics to fixing this automatically?

Aitken: Yes, and the goal is to eventually fix it automatically. But if you can’t, you’re ideally going to get to something that went wrong. And then you give it to smart people to go figure it out.

SE: In that context, SLM acts more like glue, right? ‘Okay, we understand that this piece is broken over here. Now, what do we do about it? And by the way, we’re watching the impact on everything else that’s tied into this system.’

Aitken: That’s a good way to put it.

SE: So how much effort and cost is this going to add to the design?

Conroy: I don’t think it adds a lot to the design. It’s more on the back end, making use of what you have in the design. It’s a lot more work to use these sensors and get the most out of them across the silicon and the silicon lifecycle management. And it’s created a whole new set of jobs post-silicon that we didn’t have before. As test engineers, we’re not well equipped to do SLM. It becomes a much bigger job than traditional test engineering is used to.

Goteti: It also depends on how much intelligence you want to build into designs, and the local implementation or instantiations of the sensors. So it does add cost to the design. There’s no doubt about it. But the cost is not seen as something that’s unnecessary, and people are realizing they have to do this. So just like you would add something that enhances performance, you also will add features that enhance your SLM capabilities.

Mitra: A good target would be about 5% additional cost. And maybe a lot of the design for test infrastructure can be used if we do this correctly.

Tahoori: SLM is one part of it. The other part is basically unification of all the different support infrastructure that we currently have on the chip. SLM could be basically a superset of the infrastructure if we add some intelligent aggregation. But cannot simply offload tons of data to the cloud. So there is a need for some aggregation, and this has to be done as a part of the hardware design.

Goteti: You have to be careful, though. We can re-use a DFT (design for test) infrastructure, and we can use DFR (design for reliability) and DFD (design for diagnosis) infrastructures. But we have to make sure those are usable in-field. That’s not a trivial thing to enable. We have to be careful about saying we can completely re-use what we already have.

Aitken: And we have to recognize that pretty much any of these objects can be used as an attack vector for any software running on the system. So you have to make sure that they’re even if they’re accessible in the field, that they’re privileged in a way that makes it harder to mount such an attack.

SE: That’s an interesting point. Security plays a significant role in heterogeneous designs. So once you’ve added resilience into the system, can you use SLM to automatically close off some part of the design, securely reboot it, and basically start over again?

Aitken: That depends a lot on the application. In a data center you can bring a server down, pull a block that it’s in and potentially throw it away, then it’s an expense, but it’s not insurmountable. But if you’re driving down the freeway and the engine computer says, ‘Something bad is happening here, I’m just going to shut myself off for a few minutes, don’t worry,’ that’s not a good thing. So there’s an application dependence for how you mitigate these things when you observe them. Building the infrastructure so they can be observed is clearly a key part of being able to do something about it.

Tahoori: All these layers of sensors and monitors also can help to identify whether the system is compromised. There are new attack vectors, but the extra amount of data and side-channel information can help to determine whether the system is compromised or not. So basically, SLM also can be used to improve or enhance the security of the system.

Aitken: It can also be used as an attack vector. If somebody figures out how to make all the lifecycle flags on every chip simultaneously go off, then bad things could happen, whether that’s in the data center or automated or semi-automated cars.

Conroy: But it can help a lot with component fraud, where other people’s components are being used. You can do a lot more traceability.

SE: There are a lot more vendors providing components in heterogeneous packages. How do you manage the relationship between all these vendors so that all the pieces behave predictably over time?

Goteti: It’s definitely something we have to be on top of, because these components age differently and evolve differently over time. That’s complicated by the fact that we’re getting them from different sources. There is no easy way to do it yet.

SE: Is there a feedback loop in place that allows organizations to share data for SLM?

Mitra: It’s not good enough. Is the organization going to give access to the RTL to be able to really tell what was causing the problem? That depends on what you’re trying to do with it. If you’re trying to figure out the cause of silent errors, you need a lot more information. So that infrastructure has to be put in place.

Aitken: In the IP world, very often if you’re called in to diagnose some problem in something that contains your IP, you get into a discussion, followed by rapid meetings, and people talk and talk and talk. And then at some point, nobody talks to you anymore. You don’t know if that’s because the last thing that you suggested fixed the problem, or if something else turned out to be the problem and they solved it that way. Closing the feedback loop is a really important part of getting this ecosystem to work, and it’s incumbent on everybody in the chain to do their best to make that happen. Sometimes there are legal reasons or IP reasons or practical reasons why information can’t be exchanged. The general idea of agreeing to work together and share data is a first step, but the mechanics of making that work can be really difficult.

Mitra: The real question is, for the hardest problems in the industry, do we have the capabilities to do the analysis and provide the feedback? That should be the question that we should ask ourselves as technologists.

Goteti: The question this opens up is what IP are you willing to share that goes along with that data? You can’t ask data center users to provide information about what specific workloads they’re running. That’s their secret sauce. And you can’t ask IP designers to reveal their RTL to you. That’s their secret sauce. We’ve got to figure out how to work with each other.

Tahoori: And there is a privacy issue for all the data that is gathered. You basically need to sanitize the data and still pass through the useful information so that something can be done with it. That’s a real challenge.

SE: So how do you actually test silicon lifecycle management to make sure that an electronic device is going to work reliably throughout its projected lifetime?

Conroy: At Cisco, we always talk about design for testability, but this involves more than just running a test. When you’re doing your design, you have to architect test into that design. And then you have to ask, ‘Okay, what do I need to log? What information do I actually need, and from what bits of my circuit and at what point in time? What would I log during production test?’ It could be at system test. Then, if I’m going to log data in the field, what do I need there? How is it going to be accessed? Is it safe? Those are the fundamental questions that need to be thought out at the beginning. And if you don’t, you’re always going to be retrofitting the solution. SLM is more than test. It’s involves the whole infrastructure, including all the companies that are involved, the OSATs, how all that gets architected in terms of the data pathways, and then the data analytics. Manufacturing the product is one thing. Getting to the field is a whole different case, and every customer is different. It doesn’t matter if it’s somebody who bought a car, or if it’s a company that bought a network. The use cases are completely different around the data that gets logged and the data that gets shared, and what actions need to take place. In a car, it’s life-critical. In a cell phone, it’s just annoying. So there are many, many different scenarios here, and they all need to fall into place.



Leave a Reply


(Note: This name will be displayed publicly)