Silicon Lifecycle Management Gains Traction, But It’s Complicated

Issues persist about how and where to add it in, and how to manage data; AI will help.

popularity

Silicon lifecycle management (SLM) is gaining ground in semiconductor design and test by leveraging specialized on-die sensors and analytics engines to improve power, performance, yield, and reliability.

Most modern SoCs mitigate the guesswork by leveraging DFT, which includes adding memory built-in self-test (BiST) or improving functional coverage, but these tests were meant for verifying connectivity and basic functionality. What happens when the next level of observability and analytics is needed to improve power performance, yield, and reliability? These next-level analytics are driving the adoption of SLM platforms.

The targeted analytics within SLM enable design optimizations at each stage of the design cycle, from pre-silicon to in-field operation. As SoCs and multi-die assemblies grow in size, complexity, and cost, expanding visibility is important. While progress is being made, there are challenges to overcome, including data governance, interoperability, and the importance of use-case-specific ROI. SLM also can be used to address complex issues like thermal management and power delivery networks, as well as enhance reliability and performance across the lifecycle of semiconductor devices.

Still a burgeoning technology
SLM is part evolutionary, part revolutionary. It’s a way of ensuring that increasingly complex and expensive designs don’t fail under real-world workloads. These can include multi-die assemblies in advanced packages, or SoCs developed at the latest process nodes. But regardless of the application or the architecture, SLM provides a way of monitoring a device’s behavior and marshaling whatever resources are needed to ensure it continues operating within spec throughout its projected lifetime, which may be up to a decade or more.

“The idea of SLM is actually quite new,” observed Stephen Pateras, vice president of marketing and business development at Synopsys. “Does anybody know how old SLM is? Is it 10 years or more? Both Google Gemini and ChatGPT said the first reference to silicon lifecycle management in literature is in October 2020.”

Since its inception, SLM has evolved into a multi-faceted collection of tools and methodologies for extending the expected lifetime of chips. “We hear about SLM in the context of silent data errors, these mysterious defects that are hard to detect,” noted Geir Eide, senior director, product management, Tessent Silicon Lifecycle Solutions at Siemens EDA. “We hear about SLM in the context of RAS (reliability, availability, serviceability), power and performance analysis, and extending the lifetime of designs, then doing all of this throughout the entire lifecycle, from design to deployment and every stage in between. Typically, we do this by collecting a lot of data from a lot of different instruments, sensors, and monitors, then processing it through the magic of analytics and AI, to produce actionable results for a lot of different stakeholders. It’s very ambitious, so it’s natural to ask questions like, ‘Are we trying to boil the ocean here, where some organizations might just need a glass of hot water?’ The way to look at SLM is that it’s not just one thing. It is a collection of opportunities and solutions that depend on your type of design, the market you’re in, and so forth. The right thing for you is probably a subset of sorts, not necessarily everything. So SLM, as a concept, is relevant for pretty much every single chip in every single market. But we’re talking about slightly different subsets.”

Put in perspective, SLM provides insights into what’s happening inside an increasingly complex chip or multi-die assembly, where everything from process variation to workload-dependent thermal gradients can affect performance or signal integrity. “From a test and reliability perspective, life has gotten a lot more challenging,” said Vikram Karvat, chief operating officer at Movellus. “You have a lot more spatial variation. You have increasing current densities. I remember when an added card for PCIe was spec’d to 25 watts, and it was an act of God to get 27 watts provided by a server vendor. Now we’re talking about 1,300 watts, going to 2,000 watts probably this year. We are now starting to have some fundamental physics challenges around things like thermal management. And when you start thinking about things like thermal management, this is no longer a function of just doing production tests. This is now starting to creep into what happens in the field, and that actually gets worse from another perspective, which is workload.”

Previously, processors were characterized with certain benchmark workloads, and they had certain thermal/electrical profiles. “We developed thermal design max power based on those things,” Karvat noted. “You’re now entering an environment where you could start designing a chip today, it gets deployed maybe in two years, and stays in the field for three years. You’re talking about five or six years, and over that time, something like a complex AI model may have evolved a dozen times over that same period. So now the representative workload that you tested, depending on when you manufacture the chip, may have no bearing on the current reality.”

Multiple chiplets in an advanced package only add to the complexity. “Multi-die systems have been around for a while as SIPs or MCMs, with chiplets a much more complex version of these,” he said. “But MCMs are fundamentally difficult to test. The impact of possible yield fallout is tremendous. This means you need to have predictive mechanisms to see where something might happen, because after you assemble this device, you might have a cost of goods on that device that might be substantial, and you cannot afford to throw any of those dies away. You cannot afford to have failures in the field. We now have a different set of problems that we’re dealing with as an industry, and we’re going to have to evolve, whether it’s SLM or the new DFT. Fundamentally, our view of how to manage and test silicon has to change from ‘I’m responsible for this silo, or this silo,’ to ‘I need to start thinking about this across the entire lifecycle.’

Distinguishing between PLM, SLM, and ELM
It doesn’t help that the industry is filled with acronyms, some of which are used more than once for very different things. But for purposes of chip reliability, the key approaches are SLM, product lifecycle management (PLM), and engineering lifecycle management (ELM).

For IP, the focus is on PLM. “Customer silicon products based on IP may last long past the PLM lifecycle of the IP itself,” said Nandan Nayampally, chief commercial officer at Baya Systems. For example, the ARM7TDMI from 1995 still ships in very high volumes, decades after it reached its end of life. “Digital IP is generally process and foundry-agnostic. Hence, it’s defined by feature set, capability, performance, etc. The emerging idea of SLM (DFT+BiST+more) is an added and somewhat orthogonal lifecycle for devices in the field, albeit it needs IP to be cognizant of the monitors/data that inform SLM. Similarly, a chip product, such as BRCM’s Tomahawk 5, may be sold for a decade. However, for each Tomahawk 5 chip, there may be a specific lifespan, based on the silicon conditions and usage, that is less than the traditional product lifecycle. Simply put, PLM manages a particular product line, while SLM manages a particular instance of that product.”

Others agree. “We see SLM as a distinct discipline from broader product lifecycle management,” said Andy Nightingale, vice president of product management and marketing at Arteris. “SLM focuses on the in-life behavior of silicon — its reliability, power, performance, and data integrity over time. Our IP helps close the loop between design intent and real-world operation. This is critical because logical correctness, proving a design meets its functional spec, is insufficient. Without design-time physical awareness, the implementation, performance, and reliability can all suffer post-silicon. We address this by ensuring that SoC interconnects and hardware/software interfaces are designed with physical and timing awareness early in the design cycle, enabling SLM to use timing and performance data downstream while reducing risk early in the design cycle.”

ELM adds another nuance into this field. “We build scopes, lab equipment, etc., and therefore ELM is a very similar context, to some degree, but a bit different depending on the scale of what it is,” said Simon Rance, general manager and business unit leader, Process and Data Management at Keysight Technologies. “ELM for silicon, for us, is SLM. It’s not PLM. It really is the same thing, regardless of whether it’s silicon as an application or whether it’s a system-based application, and so on, but we do clearly distinguish between PLM, ELM, and SLM.”

The promise of SLM
The overarching goal of SLM is to provide tools, techniques, and monitoring embedded in silicon to realize and de-risk silicon throughout its expected lifetime. “SLM can enable better, faster design implementation, and de-risk silicon success (like DFT/DFY before it),” Baya Systems’ Nayampally said. “It can make it robust and adaptable in the field (BiST + redundancy), while enhancing it with other sensors — monitors that provide indications on silicon health. It takes the fundamentals of RAS and extends them.”

Put another way, SLM aims to improve post-silicon visibility, predictability, and trustworthiness by instrumenting, analyzing, and responding to how chips behave in real-world conditions. “However, the value of SLM starts at design time,” Nightingale noted. “This can be enabled by embedding traceability and optimization into the interconnect and integration layers through interconnect topologies that are not just correct-by-construction, but also physically aware and optimized. So when field data is captured, it reflects real design intent and is actionable. Again, logical correctness doesn’t guarantee lifecycle success. SLM relies on deeper optimization.”

SLM also opens up opportunities for novel approaches. “Our platform currently focuses on pre-silicon flows — helping teams get RTL, verification, and constraints right before tape‑out — but we view SLM as a natural extension of our roadmap,” said William Wang, CEO of ChipAgents. “Much of what SLM tries to accomplish in the field, such as adaptive guard‑band trimming, path‑delay analytics, and wear-out prediction, starts as human‑language requests from test or reliability engineers. ChipAgents already turns plain‑English specs into RTL fixes and verification goals, so we can ingest those late-stage insights and push changes upstream early enough to boost first‑pass yield.”

The goal across all of these approaches is increased visibility. “The ultimate promise of SLM is finding things that would probably never be found, possibly, and trying to shorten that lifecycle,” said Keysight’s Rance. “One of the ultimate goals is improving yield, reliability, security, time-to-market, or operational costs. Is it a challenge to put those components and blocks inside the design to support the telemetry, the test, and debug capabilities? Yes, it is. In fact, we’ve found that typically it takes experts with quite a lot of expertise to do that efficiently. So the downside is the upfront effort, but it does reap a lot of reward downstream, because you’re taking that shift-left type of approach, where you’re finding things almost in real-time. You’re finding them way in advance. You can iterate, refine a lot earlier, a lot quicker. Otherwise, they’re found in the field way after you can do something on the current version. So that tradeoff is there, and that’s why we see that some customers have adopted it.”

An extension of DFT
When developing this technology five years ago, technologists discovered that SLM and test are synergistic. “You can essentially extend DFT in many ways, conceptually and in practice, to create an SLM solution,” Synopsys’ Pateras said. “The first baseline of SLM is the idea of managing all aspects of silicon. We refer to silicon health through all lifecycle stages, which can be broken down into four components. First is monitoring, such as instrumentation, which obtains data from the silicon. To do that, you need to add things that allow you to extract information, such as monitors. And since DFT is a monitor, all forms of DFT that measure things about the silicon follow that category. Second is transport. Once you have things to measure, you won’t be able to grab that data and bring it somewhere where you can analyze it. Third is analysis of that data, which could be done locally within a chip, at the top level of the chip, or off-chip in the system or the cloud, depending on what you want to achieve and what the latency is of what you’re trying to achieve. Fourth is acting on that analysis. You figure something out, then ask what to do with that information. You want to do something if the device is aging. You may want the device out of circulation. You may want to invoke some redundancy. There are different things you can do, depending on what you’re trying to achieve.”


Fig. 1: The components of SLM merge with test functions as they break into different vertical segments. Source: Synopsys

Pateras pointed to similarities between what historically has been done for test and DFT, and what now is being suggested for SLM. “At the very beginning, it’s all about adding capabilities into the chip so you can do something and understand it better,” he said. “DFT has been around for decades. We all know that as scan chains and memory, maybe memory BiST, logic BiST, or BiST of other components. It’s still using that, but is now looking for other aspects of the silicon, such as environmental values, like PVT monitors, or understanding structural issues like delays on paths, or even understanding what data patterns are being transmitted across busses with the idea that you want to add these things to the chip during implementation, so it’s a natural extension to what you can do.”

To achieve all this, especially to fully manage the silicon in the field — whether it be a phone, data center, CPU, or an ECU in a car — the process must be fully managed. “You need a full hardware/software stack to be able to do everything mentioned, generate the data, extract it, port it, analyze it, and react to it,” he said. “And at some point, we define nine layers of that hardware/software stack, starting with the actual monitors, going to the ability to extract the data hierarchies without the chip having on-chip firmware to be able to do local analysis, look at local metadata transformations, be able to then use telemetry to go off-chip to the cloud to look at large amounts of data over time for large number of devices. Then, having the cloud, you’ll have the ability to store data and target purpose-built analytics to do different things, different optimizations, whether it be improving the performance or improving reliability over time. And then, ultimately, you take action. For that, you need to be able to see what’s going on, understand it, and react to it. This is all to say that SLM is an extension to test and requires an infrastructure, which is happening today.”

Applying SLM approaches
When engineering teams are ready to start integrating SLM into their design approaches, Arteris’ Nightingale says teams should ensure their design data is structured, traceable, and lifecycle-ready. “This means adopting platforms for IP and register consistency and for automated NoC generation that embed timing closure and layout awareness. Instrumentation for SLM is only as good as the accuracy and stability of the infrastructure beneath it. If teams treat SLM as an afterthought, they’ll struggle with fragmented or unreliable telemetry. Logical correctness might tell you the chip should work, but lifecycle readiness requires infrastructure, which explains why it didn’t work in the field.”

Nayampally noted there also should be clarity on goals, levels of robustness, redundancy and PPA requirements, and expected silicon lifecycle, which varies per vertical and sub-vertical, such as data center, network edge, automotive, etc. “This should inform the level of telemetry and sensing involved. Most importantly, the flexibility of the IP, or design components needed to support this SLM, and the software/firmware or built-in silicon logic to address in-field management.”

Here, Rance notes that when it comes to silicon, managing the IP is paramount. “What’s key is having those debug and trace IP blocks that you can insert into the right places, integrate them with the design blocks, so that you can have that insight and that visibility into the design. Where you hook them in, and how you do it, is a little bit of a black art. At the moment, we see a lot of system architects get involved in that. We hear from engineering managers that they’d rather not have their system architects do all of this type of work, so this is where AI and machine learning are going to help a lot, leveraging the SLM from a successful lifecycle all the way through, having that data from all of the pieces, and being able to create models and learn from them. I see this upfront challenge where a lot of companies are hesitating to embrace SLM, but they’ll leverage it because they’ll use AI to essentially do what the system architects are doing to get it right the first time.”

Conclusion
SLM is not yet broadly adopted in the chip industry, but that’s about to change as AI/ML is added into the process, as SLM increasingly is integrated with DFT, and as demand for reliability over longer lifetimes continues to increase.

“Looking at this from a DFT point of view, when we compare SLM with DFT, we see this as expanding DFT in two dimensions,” said Siemens EDA’s Eide. “We’re going from the concept of pass/fail to collect a much richer data set describing the health of a chip, etc., so it’s about measuring more things. The other dimension is doing this across the lifecycle of a design. It’s about lifetime deployment, and how much of this applies to your design in the organization depends on what type of market you’re in. Especially for those of us coming more from the test side, it’s no longer about interacting with the chip. It has to do with system-level actions and communicating from the system level, not just from a test server or from the I/Os, but from a system level to the instrumentation, and performing the actions.”

Related Reading
First-Time Silicon Success Plummets
Number of designs that are late increases. Rapidly rising complexity is the leading cause, but tools, training, and workflows need to improve.



Leave a Reply


(Note: This name will be displayed publicly)