The True Cost Of Software Changes

Changes to software can impact the hardware. Concluding that hardware updates are too expensive and all problems can be fixed in software is the wrong approach.

popularity

Safety and security are considered to be important in a growing number of markets and applications. Guidelines are put in place for the processes used to develop either the hardware or the software, but what they seem to ignore is that neither exists in a vacuum. They form a system when put together.

Back when I was developing tools for hardware-software co-verification, there were fairly consistent comments about the value of the tools from the early adopters. They concluded that no matter how well they tested the hardware — using techniques like constrained random test pattern generation to achieve exceptionally high functional coverage — they would always discover more problems when they ran the software. The reason for that is quite simple. Functional verification only tests the easily accessible state space.

Most hardware systems are constructed from small state machines that are assembled in a hierarchical fashion. Hardware verification concentrates on making sure that each of those state machines has all the right transitions, and that each state has entered and taken each of the transitions correctly. As you move up the hierarchy, integrating larger pieces of the hardware together, you start to look at combinations of states in those lower-level machines. But this is where the sheer number of states and transitions becomes overwhelming. You cannot test them all, and you reason that because you have thoroughly tested every possible thing that those individual machines can do, the combination of them is most likely correct.

When running real software, tests often last for millions or billions of cycles, and will get those state machines into setups that just aren’t possible with short tests. That is when the software starts to find problems and enters the much larger deep state space. Problems will be found, and quite often it is situations that were not considered, or a term was missing from one of those transition statements, or an exclusion between two things prevents something from happening.

Long story short, hardware verification using constrained random is just the tip of the iceberg when it comes to state space exploration, and it is only software that can reach many of them. And yet, people want to say that hardware is verified for all software. This is simply not possible given the verification processes and tools that exist today.

At the same time, we are seeing a trend toward domain-specific hardware design, where actual software is profiled, and then hardware is custom-designed to optimize its execution. This is great, and it means that an actual scenario is being thoroughly tested. But what happens if that software later changes? What impact will it have? Most people will expect that its performance may be affected. Optimizations may no longer apply to the same degree, or the software may have been optimized further to utilize those features. But what about the impact those changes have on power and thermal? Are they also considered? Do you run regressions on timing and power when software changes are made? Probably not.

We are also seeing an increasing number of EDA flows that are scenario-driven. This can drive almost all parts of the development process toward more efficient implementations. In many cases, it allows margins to be decreased or better thermal-aware layouts to be found. These optimizations also will be affected by later software changes.

When we start to think about things like aging, we know they are highly dependent on temperature, and temperature is impacted both by the thermal characteristics of the chip, package, board, and system. But it’s also dependent on the amount of heat generated by the device, and this is impacted by software.

Software can age a chip. It can push parts of the hardware closer to the design margins, and it can even cause hardware to fail. Still, nobody seems willing to say that software needs to be considered when doing safety certification, or that hardware has to be re-certified based on changes made in the software. And this has to go deeper than just running functional simulations.

If the hardware has been optimized for a given set of scenarios that define the workloads expected at the beginning of development, then nobody has analyzed how that system will behave when the software is modified. And almost by definition it will be changed, either to add features, shore up security vulnerabilities, or just to fix bugs in either the hardware or software. That is all on top of the fact that the software now may take the hardware into parts of the state space it has never entered before.

It used to be that hardware had to be verified to a higher degree because the cost of making a mistake was too high. Creating a new mask set would cost a few million dollars, but most of the time the loss of market window would be much greater. But if systems have to be re-certified for each software change, then the cost of a software bug quickly approaches that of a hardware bug.

The industry is aware of this, but little is happening to rectify the situation because the costs associated with addressing it are too high. So far there are no hefty legal settlements to force the issue, and until that happens it’s easier to plead ignorance.



2 comments

Ron Lavallee says:

Another great article but based on two common premises, that software and hardware are not the same and State Machines use for sequencing. State explosion in large complex applications is a problem, so why not go stateless. Parallel flowcharts are stateless, and event driven. The software flowcharts synthesize to substrate hardware flowcharts that are an image of the software. Change the software and the hardware changes with it. I don’t think the industry has been looking for something like this, they have just been waiting for it.

Peter Bennet says:

Really interesting article. I’ve long thought that we’ve developed a rather loose discipline in chip design and verification that’s in real contrast to the much more disciplined areas like traditional hardware design or defence and automotive designs. That’s partly simply to get more stuff done more quickly. But also the result of software changes being so quick, easy and apparently “cheap” to make. Verification has also been more about proving something does what it was designed for than checking what could go wrong and designing to mitigate the worst outcomes.

Leave a Reply


(Note: This name will be displayed publicly)