Lots Of Data, But Uncertainty About What To Do With It

Sensors are being added everywhere to monitor everything from aging effects to PVT, yet the industry is struggling to figure out the best ways to extract useful information.


Experts at the Table: Semiconductor Engineering sat down to talk about silicon lifecycle management in heterogeneous designs, where sensors produce a flood of data, with Prashant Goteti, principal engineer at Intel; Rob Aitken, R&D fellow at Arm; Zoe Conroy, principal hardware engineer at Cisco; Subhasish Mitra, professor of electrical engineering and computer science at Stanford University; and Mehdi Tahoori, Chair of Dependable Nano Computing at Karlsruhe Institute of Technology. What follows are excerpts of that conversation, which was held live (virtually) at the recent Synopsys User Group conference. Part one of this discussion is here.

SE: In the past, chipmakers made one processor or SoC, worked out all the kinks, and sold a billion units. Now we’re seeing much more customized design within whatever context it’s going to be used, and those are being manufactured in much smaller batches. How is that going to affect silicon lifecycle management?

Goteti: It’s definitely going to affect everything. Everything ages differently. If we take lifespan as an example, everything is differently already. You need adaptive abilities to be able to do this. There’s no way to put that genie back in the bottle. We’re in a world where we’re going to have systems and packages with tens, if not hundreds, of chiplets in them. That requires those devices to be smart and adaptive.

Mitra: I don’t understand that. Why is it now all about chiplets?

Aitken: One reason is interoperability, and that means you have two parts that allegedly meet a standard from two different vendors. But what happens when they don’t communicate with each other? You have communications networks now bound inside a chip, but you also have a signoff challenge. If you’re going to have an object that has a 10-year lifespan, you need to think about different characteristics and sign-off than you do if you’re only going to have a one-year lifespan. You need to potentially either add guard-banding to your sign-off procedure, or you need to add adaptive circuitry so that the thing changes over time to respond to its degradation. All of that has to be done at the chiplet level, working with multiple devices rather than a single one from a single vendor.

Mitra: That’s no different than hard IP blocks.

Aitken: If you have two hard IP blocks on the same piece of silicon, they were manufactured at the same time in the same fab with the same effective variability profile. But if you have two pieces of silicon that you glue into a package, those don’t share that same source, so they will potentially age and interact differently.

Conroy: Things do age differently. Some of the sensors coming out these days can monitor the aging of the device. Will it last 18 years, or will it fail beforehand? If it’s aging early, it needs to get replaced. There are mechanisms out there today to do that reliably and safely, and the industry is already using them.

Tahoori: There are different aspects to that. There are active devices that will age, including the interconnects, and so now we are dealing with issues in the signal line. As we go to more advanced technology, with much smaller wire features, you’re dealing with higher resistance due to smaller dimensions. Even the ports, which traditionally were not concerned with electromigration, have become an issue. And that basically adds a host of new problems, including how to model them and how to build margin into them. That’s a big challenge. What type of monitors do you have to put in? There is complexity associated with putting the right type of sensors and monitors and gathering the data. There are a lot of aspects related to automation of SLM. You need all this infrastructure. What kind of sensors and monitoring are you going to use, where are you going to place them, and how are you going to perform this data aggregation, which all has to be automated. So there is a lot to be done in the automation of SLM infrastructure.

Conroy: Sensors provide a lot of information, but if you see a blip in a sensor, how do you know this actually is going to cause the product to fail? If you see some type of anomaly, the product still can be functioning just fine. Sensors are good, and they give you information that we as engineers love to look at and assess, but at the end of the day somebody or something has to make a decision around whether this is an issue and how are you going to deal. You need a lot of data, and you need a lot of human expertise to come to that decision.

Mitra: There are situations where just having the sensor sitting there is good enough, but there are many other situations where it is not. The sensors are there, but they have to be instigated. You need to apply the right kind of test patterns so the sensors actually do catch something.

Aitken: One of the problems is there is a lot of data, and at the moment we’re not quite sure what to do with it. So it’s not somebody writing speeding tickets. It’s someone with a radar gun, aiming it at cars and saying, ‘Alright, I have data on a bunch of cars. What do you want me to do with this? Do you want to know how many cars are here? Do you want to know how fast they’re going? Do you want to know how close they are together?’ And because of that, we’re not quite sure what it is that we want to monitor. Or, one group knows what they want to monitor but doesn’t want to tell the people who are selling them the monitor. So you have this situation where, at the moment it’s, ‘Let’s just throw a bunch of data at the problem and hope somebody, somewhere, will be able to analyze it and get what they want.’ Over time, I would expect that to evolve to something that’s more like a sensor that’s able itself to process a large amount of data, then say, ‘Oh, this part is interesting,’ and then forward only that on. So instead of having petabytes of data coming out of something, you’ll have a few bits of useful, actionable information.

Goteti: Finding the helpful information in the ocean of data is the difficult problem. That’s why intelligent sensing will be a requirement as we go forward — not just for data efficiency purposes to cut down on the volume of data, but to give usable data in order for that to be processed. The challenge is that intelligent sensing will always be local. You will still need some aggregation of data collected from intelligent sensors, and then you can do things like causal analysis or other operations to figure out how to make things better. But intelligent sensing, by itself, is a good thing.

SE: The issue here is to collect and utilize data across the silicon lifecycle, from design through verification. Do we have a structure to pull all that together and make adjustments as needed?

Mitra: The challenge is coming up with a scalable way of capturing information. Number one, and very important, is analyzing the information to pinpoint the cause of your problems, and unless we have a systematic and scalable way of doing that, this piecemeal approach will be problematic. There are some interesting solutions, but they will not necessarily be cheap. And often the industry is short-sighted about hardware support for diagnosis but not detection.

Aitken: There’s are several varieties of things going on here. If you look at automotive, can chips last 18 years in cars? Yes, they can. But are anomalous things happening, and what do you do about those? I have objects in my house that have declared themselves bad, but which actually still seem to be functioning. Yet they refuse to work anymore because they’ve declared themselves bad. I don’t know if that’s just a marketing ploy on the behalf of whoever sold it, or if there actually is something that’s wrong with this thing. So there’s a data management issue, and there’s a challenge of consumer devices versus a data center. You can argue that we’re all in the dark ages, but it’s equally valid to argue that the problem is so vast that it’s hard to declare that anything in particular is a solution.

Tahoori: Part of the issue is that most of the focus has been on individual sources and monitors. There needs to be some systematic way of aggregating data, which also can be somehow adapted to the use case. You want to aggregate the data in a smart way. You don’t want to offload everything to the cloud, because that would be tons of data, and it would be wouldn’t be efficient. You want to extract useful knowledge from data on the chip and in the system, but still customize it for the use case. This is the hardest problem. And if you manage to get a good grasp on it, we would be able to make some good progress in SLM.

Mitra: That’s very important. You cannot just have a bunch of sensors and monitors and expect to find something. You have to aggressively look for stuff, and then those internal monitors and sensors will provide you the observability that’s needed. But if your system is not stimulated the right way, then it’s data that doesn’t have any useful information.

Conroy: There are many different types of sensors. You can measure voltage, temperature, you can measure margin on chips. It really depends on what you’re trying to achieve. And you really need to figure out up front, ‘What am I doing with my sensors? Do I need to put them in my chips? Am I concerned about hotspots? Am I concerned about silent errors? Once I put that chip in my system, am I going to see a voltage drop? Should I be able to really measure voltages tightly across my chip as I go from running structural tests to running traffic functional tests?’ One size doesn’t fit all. The placement and the usage of sensors really takes a lot of forethought when you’re doing the design, and then you need to work across the SLM to bring those sensors to fruition. Just because you’ve designed them, it doesn’t mean they’re going to be used in the right way. If you’re extracting data from them, how is that data going to get used? First, do I have the right people in place with the right skill set to which data I need to extract? Second, how am I going to extract it and get it along some type of pipeline to the place where it needs to get analyzed? And third, I then need experts to analyze that data — people who understand analytics and can bring that to fruition. This is a very complex task. Every company will sit down and come up with their own solutions, and that leads to smarter segregation back in the industry.

Leave a Reply

(Note: This name will be displayed publicly)