Understanding chip behavior, performance, and aging from the inside.
Monitoring is an important trend for optimizing yield, performance, and uptime in systems that use complex integrated circuits, but not all monitoring is the same.
In fact, there are multiple levels of monitors. In many cases, they can be used together to help solve problems when something is amiss. They also can be used to help identify who in the supply chain owns the fix.
“If the system is not working, how do I find the root cause and what production stage the problem was introduced in so I can avoid such a problem in the future?” asked Noam Brousard, vice president, systems at proteanTecs. “This is a big issue for data centers, for instance, because they have such a long supply chain. The ability to pinpoint the problem will tell you where the responsibility lies. Today, many of the investigations end with ‘No Trouble Found’ results.”
Monitors can operate at the lowest levels, measuring basic physical properties of a feature of a chip. They also can operate at higher levels of abstraction, focusing on information flow and transactions. The real promise comes from being able to run analytics that span the layers. Analytics and drill-down can help to identify the low-level cause of a high-level issue.
ProteanTecs and Siemens illustrated the point in a joint webinar showing how high-level and low-level monitoring can work together to reveal root-cause information about high-level problems. “Our work with [Siemens subsidiary] UltraSoC is very complementary,” said Brousard.
“We work at the functional level,” said Gajinder Panesar, fellow at Siemens EDA. “We’ll have other people who provide analog monitors, and they will also be interfacing to our modules.”
Monitoring basics
“Monitoring” is a generic term for a variety of technologies used to keep track of the performance of chips. Many of them target chips while they’re in service. Others augment manufacturing inspection processes. The intent is to generate data about the chips that can be used to optimize performance and yield.
Many of these sensors are quite small, allowing them to be placed in various “whitespace” (gaps between transistors) and “grayspace” (space between IP blocks) areas. Monitor size will determine whether there is a die-size impact and, consequently, how many monitors to place.
“We see strategic placement of thermal and voltage supply sense points, whereas there is often a desire to place process monitoring circuits in regularly patterned arrangements,” said Stephen Crosher, director of SLM hardware strategy at Synopsys.
Where the monitor data goes can vary. Some can be used for instant on-die decisions. An example would be a temperature sensor that informs a controller that the clock frequency should be dialed down if the chip gets too hot. Performance might get worse, but the chip will survive the event.
Others send data to the cloud or some other data center or server. Such data can’t be used for real-time decisions because the data transit time is too long, but it can be used to track behaviors and to correlate them with their processing history.
While “monitor” is often used as a generic term, two other terms are common. “Sensor” is how many companies describe their monitors, while “Agent” is the name used by proteanTecs. The distinction has to do with the output of the monitor, with sensors delivering human-understandable measurements and units. They say this has a strong impact on the monitor die area.
Agents, by contrast, capture raw data and send it untransformed to the cloud, where it is further interpreted and turned into meaningful data. ProteanTecs claims these agents can be much smaller in die area because they do minimal processing of their measurements.
“The Agent measurements are communicated to our software platform, which resides either in the cloud or on the customer’s premises, where we fuse the data together and apply machine learning algorithms and advanced analytics to create new and deep data that wasn’t available before,” said Brousard.
Monitor size may be traded off against accuracy, especially if analog circuits are involved. “We do see that analog-based sensors will always be required because of the accuracy they deliver,” said Crosher. “Ring oscillators or logic-based solutions just don’t cut it in terms of accuracy.”
While different monitors share these high-level characteristics, they differ widely in what they measure and how they affect performance. They can be loosely organized into layers, with each layer being more abstract than the layer below it.
Monitors that help with die inspection
At the very lowest level are sensors that assist with process inspection. Many of these last only as long as the layer being inspected remains uncovered. Because of the fact that the die isn’t complete yet, no actual circuits can be used.
Before the next layer is laid down, an e-beam probe reads the structures in place to confirm that critical metrics are met. “Because of our flexible and precise capability for e-beam placement, our system is uniquely suited to stimulate and measure passive voltage contrast vs. conventional raster-scanning e-beam tools,” said Dennis Ciplickas, vice president of advanced solutions at PDF Solutions. Once the next layer is in place, the prior layer’s sensors are no longer accessible.
Using an e-beam allows inspection of tiny, targeted features. If they are measured using more conventional aggregate current techniques, those features would get completely lost. “You can’t measure it with anything else,” said Ciplickas. “It’s like looking for the nanoamp leakage in hundreds of milliamps of standby current to see that there’s this little tiny guy in there that might eventually pop.”
Some such monitors survive to completion, and they can be read by test equipment at wafer sort or final test. Some of those will be placed within the die, while others are placed in the streets or scribe lines between dies. Those can be measured as long as the wafer remains intact. Singulation destroys them, leaving only structures within the die.
The on-die structures can show both how the raw die emerged from processing as well as any changes during packaging and system assembly. “Having these structures available for comparison testing, both in wafer form and in package form, allowed the customer to go back and properly guard-band the upstream operations to eliminate downstream losses,” said Mike McIntyre, director of software product management at Onto Innovation.
While the focus with such monitors tends to be on process control, it also can be helpful for device characterization. “In today’s foundry-dominated manufacturing environment, a secondary reason for having more on-die test structures has come about because the scribe-line structures in place today are there for fabrication process monitoring and not device characterization,” said McIntyre. “If they would like to comprehend baseline characteristics important to the performance and yield of their part, these designers are required to include on-die structures for just such characterization.”
McIntyre also pointed out their utility for traceability. “In some cases, these on-die monitors can be used as a digital watermark to help trace material as it flows through the back-end, in and out of the various form factors used when other physical forms of traceability are not present or accurate,” he noted.
Monitors that assess the results of processing
Next up are monitors that provide information about the essential nature of a chip. They might be called process monitors or simply physical (or electro-physical) monitors. Due to processing variations, each chip will be slightly different, with an almost unique “DNA.” The monitors allow collection of basic data that will stay with all of the other data being collected by monitors from then on.
“Based on these Agents’ measurements, our platform can tell you, for example, if a particular chip is of the fastest process variation, the slowest variation, or one of hundreds of variations in between, in addition to telling you about the variation within the chip,” said Brousard.
For instance, if a device happens to end up in the slow corner of the process, then that fact (and the specific numbers) can be used to correlate other observations throughout the life of the chip. Trends and issues with multiple chips that have a similar DNA then can be identified. “It’s a new thing to collect data from the field across a fleet of chips and use it to optimize the fleet,” noted Ciplickas.
“We’ll look at chips of similar characteristics in a fleet of cars and compare their performance and behavior to identify maintenance and service candidates, based on actual monitoring and not preventive maintenance schedules,” explained Brousard. “Agents can gauge actual stress (e.g. workload, temperature, etc.) being applied to the electronics so that we can compare apples-to-apples the actual electric performance based on another type of Agents. The beauty of this deep data is that we are identifying these developments at a physical layer. This allows the software to indicate a need to pull in for a checkup soon, or even better, we know how to predict ahead of time when maintenance is needed so that the customer and fleet manager can better manage their time and operations, respectively.”
Measurements at this level are very specific, and they focus on a few critical parameters. “For example, gate-to-drain capacitance matters a lot,” said Ciplickas. “Or looking at things like oxide degradation, or understanding the PMOS and NMOS devices and why they’re different, or mechanical stresses.”
It might seem like this would be a “measure once and done” thing, because after manufacturing is complete the process has been set. But these monitors also track aging effects. The process details as identified on a fresh die may be different from the same measurements on the same die five years later.
“Aging is certainly a factor that people are concerned about, particularly for markets where chips are in the field for a long time in safety-critical or security-critical applications like automotive or avionics,” said Panesar.
Some of these measurements also may change as the die is singulated, packaged, and mounted on a circuit board. In each of those steps, the physical stresses on the die may change, and those changes could be reflected in various parametrics that the monitors can measure.
“In-die sensing with an e-beam measurement in the line is one level,” explained Ciplickas as he described the steps. “Then there’s in-die sensing, measuring, and wafer sort test while you’ve got the wafer in the bowl shape; in-die sensing once you’ve diced it and put it into a package (but now it’s in a handler); and then sensing when you glue it to the board, and now the solder is holding it down and the borders are flexing and moving, and then you put the system in a rack.”
Performance monitors
Next up is another kind of physical monitor — one that measures performance. This is more than just looking at speed. It also may look at critical-path margins and other power- and speed-related parameters. It can include voltages and die temperatures, as well.
“You can determine what’s going on in a more localized sense within the die,” said Crosher. “If you’ve got a multicore architecture, you can work out the supply conditions around a CPU core, or a cluster of CPUs, or the temperature around that cluster of CPUs. You can then more tightly manage that particular area or that cluster.”
For example, if the local die temperature by one core gets too hot, then the frequency or voltage can be reduced to lighten the load. At a higher level, some of the load from that core could be moved to another core to help balance the load better.
Margins also can help to determine the lowest possible supply voltage for a specific chip. If that chip is from a slow corner of the process, then margins will be lower and there’s less room to lower the supply voltage without violating timing. On a faster die, the voltage could be dropped further for more power savings.
“If you have critical-path monitors that mimic critical paths elsewhere on the chip, you can look for the minimum supply voltage,” said Crosher.
As the chip ages, changes can help to inform process tweaks or even test changes. There’s usually a focus on keeping test costs low, so mission-mode monitoring can show which additional tests may be worth adding. “You can feed the data back to test saying that you should be running more tests on these kinds of die,” said Ciplickas. “In the field, they’re shifting and moving, and you could pre-figure that out with an additional test that you otherwise would have thought was too expensive.”
Environmental monitors
These monitors look outside the die, focusing on the package and system level. This is intended to measure whether there are any problems relating to die assembly or the soldering process or any other external manufacturing issues. In addition, they may look at ambient temperature or other environmental factors. Chip externalities are measured by comparing on-die monitor data in different chips on the board, where key differences may indicate an issue.
“Is it the chip that’s at fault?” said Brousard. “Or maybe there’s an issue with the application stress, or you have a bad voltage supply, or you have a bad clock supply, or 100 other reasons that have to do with the environment of the chip.”
In this manner, it’s possible to study two different die that are otherwise similar (with similar DNA, other parametrics, and test results) that appear to operate differently. It may be that they’re in very different environments and that those environments are affecting the parametrics.
Logical, protocol, or transactional monitors
At this point, we make a big jump in abstraction. Instead of measuring low-level parameters that relate to individual transistors or signal paths, one can look at more abstract notions relating to how the die performs its functions.
For example, in the case of cores, critical events like cache hits and misses can be measured. For networks-on-chip – or even packets coming in from an external network through a communications port – statistics regarding the processing and routing of those packets can be tracked and reported.
“If I have a multicore situation, and I have a lot of traffic, or the division of labor between the cores is not equal, then the logical monitors will notice because they’ll be counting the number of transactions,” said Brousard.
This provides data the system builder is more likely to relate to because it’s closer to the actual application being performed. So when something is going wrong, this type of monitor is likely to provide an error report that will make more sense to the system designer. On its own, however, it may not give an indication of the root cause of a problem. It doesn’t necessarily identify who should be responsible for solving the problem.
Fig. 1: Different monitor levels provide different functions. Working down through the layers with a unified analytics solution will help to identify the root causes of issues encountered at the highest levels. Source: Bryon Moyer/Semiconductor Engineering
Transcending monitoring levels
Where these different monitors can be particularly powerful is when they are brought together. Conceptually, when an issue is identified at the highest logical level, there is a need to drill down to identify root cause. That could mean poking through monitor data all the way down to the lowest levels of the monitors that remain functional after manufacturing.
While the inspection-level monitors that don’t survive processing may no longer be available for measurement, their data can be. And that data can help with low-level correlations if it ultimately is determined an adjustment is needed to inspection and processing settings.
It’s also possible to connect monitors from different companies onto the internal infrastructure from one company. For instance, both proteanTecs and UltraSoC have an interconnect scheme to route monitor data to a communication port. They can host each other’s monitors – or monitors from other companies.
Synopsys is in a similar situation. “We’ve deliberately designed the monitoring subsystems so that you can hook up and connect other sensors into it,” said Crosher.
Which infrastructure is best for a given application can vary. Some can handle higher data bandwidth than others, so the specific needs of the overall monitoring strategy for a chip will determine the best way to transport the data on-chip.
In addition, it may not be any-monitors-on-any-infrastructure. Siemens and proteanTecs appear to be casting their net more broadly, while Synopsys is being more selective about the partners it integrates with. “There does need to be some partnering, because the scope of opportunity is so broad,” noted Crosher.
Conclusion
Today, most tools focus on the monitors from a single company. While proteanTecs and Siemens, for example, know how their tools can be used together to solve problems, at present one has to move back and forth between the tools – with the exception of some select cross-correlations.
That’s a temporary run-before-walking approach. “Once we get enough traction, I assume we’ll be working on how our tool and their tools fit into each other as a plug-in or as an API between them,” said Broussard. “But right now, it’s just a concept.”
PDF Solutions’s Exensio tool will accept data from any source, allowing the many levels to be traversed in one place. Onto Innovations is unique in that it doesn’t make any of the monitors themselves. But its software tools can accept data from a wide variety of monitors, giving another option for analytics and fault detection and classification (FDC).
As this market comes into its own, we may see greater cooperation between providers of different levels of monitors even as other companies compete with monitors at the same level. Which tools and infrastructure will embrace the most monitors, crossing between layers in an attempt to provide top-to-bottom continuity of data, remains to be seen.
The idea of final test may useful for production
but self monitoring is an ongoing activity in mission mode for resilient systems. Each power up cycle should run mbist. periodic training and calibration ,
voltage and temperature sense and self throttling
is a way of life and preventative maintenance and secure remote upgrade is essential to a healthy enterprise.. stuff faila and when it does, it’s just a matter of time before something dreadful occurs.