Data is being generated everywhere. Will that make chips more reliable?
Strategies for building reliability into chips and systems are beginning to shift as more sensors are added into these devices and machine learning is applied to that data.
In the past, system monitoring relied heavily on MEMS devices for things like acceleration, temperature and positioning (gyroscopes). While those devices are still important, in the past couple years there has been an explosion of sensors that are much more deeply embedded into systems. So instead of just determining when a device is running too hot, they can be used to sense minor variations in temperature in one tiny area of a chip.
That allows some circuits to be powered down to avoid problems in a densely-packed system of processors and accelerators. So in a series of contiguous processing elements, some may be on while others are off, and they may change back as those areas being used heat up. The result is increased uptime, lower power, and reduced circuit aging and parasitics.
This is a different way of looking at reliability, and not every device is worth this level of engineering. It adds cost, time and complexity to the design and testing processes. But for chips used in mission- and safety-critical applications at the most advanced nodes, this is just the start of what will likely prove to be a radical change for semiconductor and system design.
So rather than relying on the circuits to hold up for years under all possible stresses, the focus is shifting to a dynamic data-driven system model. The reality is that at 5nm and below, the number of possible corner cases is growing too large to comprehend in a reasonable amount of time. There are too many possible interactions, use cases, and things that can go wrong in the supply chain and in manufacturing. It’s virtually impossible to test for all of them.
On the other hand, having a system that can adapt to any corner case by adjusting performance, power and prioritizing utilization of shared resources is very possible. This is monitoring, provisioning and partitioning all rolled into one, and to make this work it requires sensors to be built into every block or critical data path in a chip. The data from those sensors then needs to be sequenced so that if a problem shows up in one area, it triggers a set of protocols for how to react to that problem.
This is essentially the kind of failover ISO 26262 requires in automobiles, only much more granular. If a circuit fails due to some latent defect or a stray alpha particle, sensors can send alerts that no signal is passing through that area. It then can be rerouted through some pre-planned routing scheme.
In effect, this approach uses the entire device for guardbanding, an approach that memory makers have been employing for some time. It’s not without challenges, of course. It requires more sensors to be architected into a design, a cohesive data strategy for how to utilize those sensors, as well as new machine learning algorithms and a system-level approach to managing a chip. But in a complex heterogeneous design, this is a brand new way of looking at reliability, and it’s one that ultimately may be more effective than trying to find every possible defect and corner case that can cause a failure.
Leave a Reply