Reliability is no longer just about one chip, or even one device.
Connected devices are everywhere, and the numbers are growing by orders of magnitude. There are 7 billion people on the planet, but there are expected to be many more billions of connected devices. Each person may have dozens of devices with multiple chips, and those will be connected through infrastructures filled with thousands of additional chips.
The problem is that as everything gets connected and begins to interact, the number of dependencies and unexpected interactions can grow exponentially. Machines talking to machines is an interesting concept in theory, but making sure they don’t overload servers or communications infrastructure so that safety-critical messages can’t get through quickly enough is a huge issue.
One concept that seems to be gaining traction, at least in conversations among chip companies these days, is resilience. Chips, and systems of chips, that can recover from failures by automatically rerouting signals is hardly a new concept. Fault-tolerant computing and off-site backup plans for data centers have been around for decades. But that’s an expensive approach, and just adding another CPU or GPU as a failover option in case something goes wrong would push costs well beyond what consumers are willing to pay.
An alternative approach is to be able to monitor behavior of circuits, identify when there is aberrant behavior, and then to reroute traffic to another core or processor that is already built into the system. And while that rerouting may not be the optimal approach, it may work well enough until something can be replaced or fixed.
This is a relatively straightforward engineering challenge, and one that is implemented in many devices today. ECC memory can correct flipped bits. Smart phones can shut down features or turn them on when logic blocks get too hot. But it gets significantly more complicated when multiple connected devices begin interacting. A failure in any part of the communications chain can have a debilitating impact on multiple devices upstream and downstream of that failure, and that can happen inside a single complex device or across multiple devices developed by completely different vendors.
The underlying concept here is cross-layer resilience, and at its core it’s basically matrix multiplication of different possible interactions under different scenarios involving different devices. It’s the kind of problem that typically would be run on very powerful computers, all working in parallel, to come up with a distribution of likely scenarios. It’s also the kind of problem that EDA needs to begin thinking about, first to automate and then to develop standards so these kinds of issues can be addressed at the conceptual level before chips and systems of chips are even designed.
The opportunity here is huge, because essentially it moves design automation from the chip to the system and systems-of-systems level, where budgets for essential tools also are much larger. But it also will require some concerted efforts on the part of EDA companies, which in the past have shown a willingness to cooperate only after being dragged into standards organizations by their customers. The next phase of technology is fascinating, expansive, and multi-faceted, but to really make it work will require input from a broad array of experts who really understand how all of the pieces best fit together, along with their knowledge of how they potentially can fail.
The future of EDA is no longer just about making sure that signals can be routed and verified on a single piece of silicon or even within a package. It’s about making sure those signals can traverse a minefield of possible interruptions reliably over a specified time period, with a reasonable number of options in case something goes awry at any point in the communications chain. And this will require thinking well outside the chip and between companies that have never worked together before, as well as those which are deeply suspicious of each other.
Leave a Reply