In-Chip Monitoring Becoming Essential Below 10nm

Complex interactions and power-related effects require understanding of how chips behave in context of real-world use cases.

popularity

Rising systemic complexity and more potential interactions in heterogeneous designs is making it much more difficult to ensure a chip, or even a block within a chip, will functioning properly without actually monitoring that behavior in real-time.

Continuous and sporadic monitoring have been creeping into designs for the past couple of decades. But it hasn’t always been clear how effective these approaches are, how much they will cost in terms of resources, and how far afield of a block or a chip these kinds of techniques should extend. This is especially true for safety- and mission-critical systems, where a design needs to be fully functional for extended periods of time, and in systems of systems, where the context extends well beyond the individual chips being designed.

This becomes more difficult at each new process node, as well. For example, at 40/28nm, when designers first began to view heat as a first-order issue, software developers started to look at deploying sensors to monitor power usage. But at those nodes extensive monitoring was still considered optional. At 16/14nm, some sort of monitoring and self-test became essential. The introduction of finFETs created a spike in dynamic power density, as well as on-chip hotspots caused by heat trapped between the fins.

The problem is getting worse at 10/7/5nm, and with that so is the need for more consistent monitoring. The number of causes for thermal issues is rising by the node, and often by the application. Thinner copper wires and gate oxides, new architectures that emphasis throughput and always-on circuitry, and various types of switching and power noise all add up to the need for better understanding of what is happening inside a chip. In the past, much of this was dealt with by adding margin, but that is no longer possible at these nodes. So the key now is how to identify problems as they arise, and particularly before they impact signal integrity. And this is why in/on-chip sensing, and much more extensive simulation, are getting much more attention these days.

“There are more on-chip sensors to track process, voltage, and temperature across the die,” said David Stratman, senior principal product manager at Cadence. “This is increasingly important for high-reliability markets like automotive, battery-driven markets like mobile, and TCO/ROI-driven networking and HPC markets, including the growing AI accelerator and training sectors,”

The primary issue is rising complexity and a spectrum of possible interactions.

“It’s systemic throughout the industry,” said Gadge Panesar, CTO of UltraSoC. “Even so-called simple systems have one to four cores or more, peripherals, memory, and things don’t work. There’s hardware acceleration, there’s heterogenous architectures, along with resources that are shared across different competing tasks. But it’s not about the core. It’s about the system. The whole system has to work. You could have an Arm, a RISC-V core, MIPS, or your own homegrown stuff. They’re not useful unless you make the whole system work. Such is the case that today, verifying something in one little bit is not enough.”

Consider a server, for example. A Google search will touch around 1,000 CPUs, but verifying blocks in isolation is no longer sufficient to guarantee functionality.

“You need to try and verify whole SoCs, followed by the whole system. In order to optimize and understand the behavior, you need visibility of that system to understand what your performance is going to be,” Panesar said. “If you do a search and it is 99% complete within 1 millisecond, and 1% of those searches take more than 1 millisecond, you’ll get a third in the long tail—that’s one in 100. If you have one in 10,000 searches do that, you’ll still get 20% that will take more than one millisecond. That equates to performance loss, money, power. This is where the cost is, in searches.”

Case in point: Google reportedly found a bug in its fleet of servers that had been there for years. A consultant spent a number of months analyzing the system and concluded it was down to long tails. The fix paid for his salary for the next 10 years.

This is why in-chip monitoring is becoming so critical. It provides insight into the actual operation of a system in the context of how it is being used. The need for in-chip monitoring is being used to provide real-time information on temperature, delay, and dynamic voltage drop, all of which can impact the operational correctness, reliability and aging of the chip/3D-IC, said Norman Chang, chief technologist for the semiconductor business unit at ANSYS.

In particular, Chang said the thermal gradient for complicated 3D-ICs may be impacted by many combinations of use cases involving multiple chips of different process nodes. The interface material between chips will be vulnerable from thermal overheating in addition to the heavily active area of individual chip.

System-wide analytics can help significantly. So rather than just debugging a core, that core can be viewed in the context of the whole SoC with run-time analytics.

UltraSoC’s Panesar said this kind of IP need to be highly parameterizable so the engineering team can decide how to allocate resources based on cost, die area and power. By adding in configurability at runtime, the same hardware can be used to provide different data for different scenarios.

From optional to essential
For in-chip monitoring at advanced nodes, PVT monitoring IP now should be considered foundation IP, said Oliver King, CTO at Moortec. “It is as fundamental to an advanced node SoC as I/O cells, PLLs and standard cells. In-chip monitoring started out as an insurance policy some years ago, but is now very much part of the SoC architecture and it enables substantial power savings. That said, this now places a new level of importance on the IP from a functionality point of view. The robustness of the IP is a key requirement.”

On-chip monitoring can be broken into two pieces, according to Steve Pateras, senior director of marketing for test automation at Synopsys. “The monitoring can be done at different points in the development and test lifecycle. You certainly can do monitoring or measurements during manufacturing test, or during system-level test where you use the whole system, but you still want to run different tests within the system to understand its operation. Generally, on-chip monitoring is used for reliability purposes, and that’s usually during the functional operation, during the field portion in the life cycle. That means that you’re not connected to something. You’re not connected to a tester or from a bench top. The chip is in the system, the system is in the field. How do you get that information? That’s where telemetry comes in.”

Ways to gather data and communicate in two directions are required, he explained. With some technologies, software runs within the system and directly links into the various on-chip capabilities. “This could be things like logic BiST and memory BiST—various forms of built in self-test where you’re running this either as a key-on or key-off operation, or even periodically in which case the telemetry must be sent somewhere, not only off-chip but off-system. Then, there are different levels of communication. For one thing, there must be a chip linkage into the system.”

Here, a number of tools providers are working on solutions at this time to go from the chip level to the board-level bus, system bus, or off-chip to a centralized processor, whether to it be a service processor or a safety processor. That safety processor then needs to connect to the outside world. That’s a bit more of an open problem as there is no standardization. It’s up to the system manufacturers to figure that out,” Pateras said.

Tesla already has communicated to the car’s operating system, he said, which means that telemetry already occurs. “When you are driving your Tesla and it’s sending data to Tesla constantly over the cloud, they have an LTE network so the car is always connected, sending telemetry back to Tesla. In their case, it’s things like positioning and other operational parameters get sent back to Tesla, and they can obviously send data back to each of the cars. They can broadcast data. They do that for software updates on a periodic basis so that infrastructure is already there. The issue then is what additional telemetry can we send?”

At the moment, many engineering groups would like much more detailed test, diagnostic and predictive data, which can then be sent over the airwaves. “Currently, there are things that are test related—more like traditional tests such as BiST to be able to test the logic, the memories, the I/Os,” Pateras said. “BiST tends to be more of a periodic kind of test, where it happens at certain moments in time, at power on, or increasingly at various intervals. But it’s not continuous monitoring. To do that requires a potential to break up the BiST test so that it runs every few milliseconds in order to monitor lower-level things like power supplies and clock networks.”

All of the various data together then needs to be analyzed using data analytics and machine learning to see if there are trends in that data, he pointed out. “You’re not necessarily looking at just failures,” he noted. “You’re looking at performance data to see if it’s just trending toward some kind of failure in the future. You want to be able to predict that so a lot of analytics are involved that include the IP on the chip together with this data.”

Security

Beyond reliability, another area gaining ground is in-chip monitoring for security.

“You want to be able to monitor for any kind of attacks, which is even more critical the more we rely on automated driving and automated functionality in the car,” Pateras said. “We must ensure there’s no hacking occurring, so it needs to be monitored. Activity coming into the chip must be monitored to protect against attempts at external accesses to certain buses on the chip. The chip also needs to be monitored to avoid certain behaviors, or to at least flag activities as unsanctioned activity. This area is not as far along as on the reliability side, but it’s something we need to focus on,” Pateras stressed.

From a reliability and security point of view, data centers are almost looking at exactly the same requirements as functional safety, given that they have to be up and running 24/7 and if anything, security is even more important in data centers, he added.

Panesar agreed. “If we are observing what’s happening in the target system, we can look for things that should happen but haven’t happened, or things that have happened that shouldn’t happen. That provides a layer of security. This doesn’t replace the security that already exists, but it provides a monitoring for non-determined behavior. From that, alarms can be raised, which are reacted upon by the target system depending on what they want to do with it.”

Conclusion
Given the complexity of advanced nodes, the challenges of heterogenous systems, and systemic complexity, visibility into a system for reliability and security is an absolute must for SoCs today. While it is not clear which methods for collecting data and monitoring systems will stick around over the long term, it is likely that customers may adopt more than one in order to gain the best insights into their designs.

There may be a penalty to pay in PPA, however. Failure is not an option, so in-chip monitoring technologies are coming to bear as the competitive nature of the semiconductor ecosystem continues to churn.



Leave a Reply


(Note: This name will be displayed publicly)