Silicon Lifecycle Management’s Growing Impact On IC Reliability

SLM is being integrated into chip design as a way of improving reliability in heterogeneous chips and complex system.

popularity

Experts at the Table: Semiconductor Engineering sat down to talk about silicon lifecycle management, how it’s expanding and changing, and where the problems are, with Prashant Goteti, principal engineer at Intel; Rob Aitken, R&D fellow at Arm; Zoe Conroy, principal hardware engineer at Cisco; Subhasish Mitra, professor of electrical engineering and computer science at Stanford University; and Mehdi Tahoori, Chair of Dependable Nano Computing at the Karlsruhe Institute of Technology. What follows are excerpts of that conversation, which was held live (virtually) at the recent Synopsys User Group conference.

SE: As semiconductors are used for safety- and mission-critical applications, and as complexity increases with heterogeneous designs, there’s a lot more focus on silicon lifecycle management. Chips need to last longer in automotive, industrial and data center applications, and the cost of designs is driving up the need to extend the lifetimes of semiconductors even in cell phones.

Goteti: Traditionally, it’s been about extending lifespan and getting data that you feed back in for yield and manufacturing purposes. But the scope has changed now significantly, and what we’re going to see is that silicon lifecycle management has to change accordingly. We’re going to see a flood of data coming in from chiplets — multiple chiplets inside a system-in-package. It will have to be used for all kinds of things in data centers, from workload balancing, dynamic performance improvement and management, as well as the traditional telemetry-type applications. So it’s definitely a nascent field, and there’s a lot of work needed. But it’s not new. It’s been going on for awhile.

Conroy: From a data center and networking products point of view, it’s a combination of hardware and software. The two have to work together continuously, with no bugs. From the hardware side, you’re looking at some type of heterogeneous integration with a lot of different components on there from different suppliers. The first challenge is really getting your head around that and going, ‘Okay, what are the components? What does each do? What’s the risk on each if I’m going to go into SLM? What are the critical components that I want to be monitoring in my product that could adversely affect my network? Number one is really understanding your product and how that product is tested, and what sort of functions it will perform over the lifecycle. And then you’re going to say, ‘Okay, if I want to monitor SLM end-to-end, I’m going to go from wafer sort right to the field. So if I’m going to be testing and monitoring my chips, what exactly do I want to be monitoring? And how, how am I going to monitor that? What data do I need to grab? How am I going to transport that data — from the source, from the test, or from the field — across a network and into an area where I can do real-time analytics?’ There are many components to SLM. And now we have things like cloud solutions, where we’re now able to do end-to-end analytics. But it’s very complicated, and we’re just at the tip of the iceberg for what’s going to happen in the future.

Aitken: It doesn’t just start at wafer testing. We have to think about what actually needs to be present in a CPU, in the surrounding logic, in the I/Os, and so on — what actually has to be there in order to provide the data. What can you do with the data? What we ran into a lot, even in the IoT space, was that if you’re going to do device management of some kind as part of your silicon lifecycle management, how do you do upgrades? How does software get updated? How does a device trust the software provider? How does the cloud service know to trust the device? There are a lot of problems and challenges all throughout this process, and there’s a lot of work to be done. But there’s a lot of progress already.

Mitra: It’s interesting to hear my industry colleagues talk about how they’re already doing it. We are in the Dark Ages, very far from where we want to be. So if the network is down, we know we are in trouble. But what’s happening in the real world today isn’t that things are going down. It’s that they’re producing incorrect results, and nobody knows those results are incorrect. They’re called silent errors, and the industry doesn’t seem to have a solution for this.

Aitken: It’s possible to be in the Dark Ages and still making progress. There’s a general agreement that there’s a lot of work to be done, but that doesn’t mean nothing has happened.

Mitra: But progress is happening at a glacial pace.

Tahoori: On the positive side, there are a lot of opportunities. As we are move forward, systems are getting more complex. We are dealing with many issues beyond the quality of the chips and the system, including trust. SLM can be a solution A lot of progress still has to be made, but SLM holds the promise of solving some of the challenges with design and verification and trust in very complex hardware and software systems. If it is done right, we can deal with the challenges from increasing complexity.

SE: Is the solution better design, including more verification and simulation, combined with in-circuit monitoring when a chip is in the field?

Goteti: It depends on what you want to achieve. With silent data corruption, silent data errors, those could be due to things like manufacturing defects. That’s where better design, verification, test content might help. But if you’re looking at things like dynamic workload balancing or performance per watt adjustment, better verification is not going to help you in those kinds of situations. So you can address some things with better design, better verification, better test content, but not everything. You have to choose your battles, and the strategies will be different.

Mitra: I agree and disagree. Quite a few of these things are dynamic in nature. You cannot just do it statically at time zero and hope everything is working. You have to be adaptive in the system. But when you have adaptivity, it has to be verified. And you have to make sure that the things do not go awry in the field. So adaptivity will impose more verification and more testing at the same time.

Aitken: It also involves security. You mentioned silent data corruption as a challenge. But your object being hacked or been used as the beginning of a botnet is also a challenge, and you need to make sure that whatever monitoring capability you have on the device is capable of identifying when the device is under attack and doing something about that. That’s yet another vector that you could potentially pursue in this area.

Tahoori: In moving forward with the requirements of the system, adaptivity is something we have to deal with, but it is not necessarily SLM. They have some overlap, but they are not necessarily the same thing. SLM covers a wider scope and allows us to basically gather data on a population of the system and the chip. And from that kind of data, we could infer much more useful information that wouldn’t be possible by just doing the adaptation on a single system or device. That provides the ability for a population of devices and systems to do anomaly detection, whether it be defective behavior, silent data corruption, or some sort of security breach.

SE: That opens a can of worms, because getting some of that data is very difficult. For 20 years we’ve been talking about who owns the data, how much is going to be shared, what are the privacy concerns associated with that data. Has that improved?

Conroy: When you do your own chip, you have your own data. If you’re buying components from other suppliers, you may or may not want that data, depending on what the component is. Usually, when you’re buying silicon from other suppliers, they really don’t want to share any data around that silicon other than it’s a passing chip and it meets the spec. But with SLM, the whole point is that you do want to have data flowing down your supply chain. If a part fails, and it’s not your part, you want to know why. You would like to have more data to help you diagnose it and identify the root cause. There is still a reluctance in the industry to hand out data from our private companies because it becomes a support burden for them to manage that.

Aitken: It’s also potentially a liability burden. When somebody owns the data, somebody else may own the problem. You need some combination of design data, foundry data, test data, production distribution data, in-the-field data, and it’s all owned by five different companies. Each one, at some level, would like to own some aspects of the problem, and at other levels would like someone else own the problem? The movement of who owns what, and who is going to guarantee what, is part of the challenge. Who has what incentive to gather and use what data at what time?

Mitra: This is an important point involving the reliability and the security of the data. I’ve seen many forums where we get into this discussion about who owns the data, but the problem is determining which data we are talking about. Most of the time people do not even know what data to collect, let alone who owns the data or who is responsible for it. That’s important, but the real focus should be on what data to collect, what are the mechanisms, what is the instrumentation, what needs to be put in the architecture to be able to collect the data. And how do you analyze the data? That’s where we are way, way behind.

Goteti: I agree that data volume is going to be a significant issue, and we’re going to get a flood of data. If you assume you have 50 or 60 chiplets inside a package, you’re going to get lots of data from telemetry from all of these, and processing that is going to be difficult unless you have an efficient system to do it. But going back to the question of who owns the data, it’s an open question that needs quick resolution. We’re not the only trailblazers. The aircraft industry has been doing this for a while with big data. Engine manufacturers collect data from the engines, and then decide whether to share or not share that data with the airlines or the aircraft manufacturers themselves. That’s something we hear in the semiconductor industry that we need to figure out — and fairly quickly because that data is coming. We have a lot of data already, and we’re figuring out how to use the correct data.

Mitra: Your signal-to-noise ratio is very small.

Goteti: It’s important to find a signal in the noise, but we need to resolve both these issues. We need to resolve how data gets processed and how we handle large amounts of data. And then we also need to figure out who gets to use this data and in what way, irrespective of who collects it.

Related
Lots Of Data, But Uncertainty About What To Do With It. Part 2 of above panel.
Sensors are being added everywhere to monitor everything from aging effects to PVT, yet the industry is struggling to figure out the best ways to extract useful information.



Leave a Reply


(Note: This name will be displayed publicly)