Engineering’s Growing Blacklist

Disasters are now measured by system-level flaws. It’s time to reevaluate the whole supply chain.

popularity

The number of system-level design flaws is rising, and they’re not just little mistakes. These are high-profile errors that are making headlines all over the globe.

While it’s debatable whether Toyota’s problem was a hardware or software design glitch, the simple fact is there was a design flaw somewhere. That’s true for the BP Gulf of Mexico leak, regardless of who’s responsible for maintaining a drilling platform. And it’s no different for the Japanese nuclear plants, which were built too close together and designed with insufficient backup power.

Closer to home, Apple’s iPhone antenna design was flawed. So was Intel’s initial Sandy Bridge chip.

So what do all of these problems have in common? Engineers have long argued that they can engineer a solution for just about anything. They’re probably right, given the appropriate resources, an understanding of all possible scenarios and a well-defined set of conditions. But at the system level—and that includes everything from ICs to full-blown industrial facilities—rising complexity, time-to-market and cost pressures, and the inability of engineers to verify and predict everything that can go wrong in a given amount of time are causing everything from minor nuisances to full-blown disasters.

Looked at individually, these look like unrelated errors. Looked at together, however, they point to a much broader problem across the entire supply chain, from design to manufacturing. Things we have taken for granted, largely as a result of years of slow evolutionary tweaks, now require more radical reassessments. That includes everything from the tools being used to design and create systems to budgetary considerations about what can be cut and what areas need additional resources. Inside of many companies these kinds of decisions are highly politicized, and something clearly isn’t working.

This kind of reassessment needs to run end to end. If there’s a problem at the back end, chances are good that the initial design was flawed and that the people doing the design didn’t have the right kind of training. In the case of Apple and Intel, the problem wasn’t so much cost cutting as unbelievable complexity and aggressive schedules. Tools do exist to solve some of these problems, but it takes time to train people to use them and to understand their benefits—something verification engineers have been complaining about for the past decade.

Unfortunately, we will probably see more of these engineering mishaps in the future. It takes awhile for executives at public companies, who are used to living quarter by quarter, to recognize the full financial impact of not addressing these problems properly. This isn’t a new issue, either. The old adage of, ‘Penny wise, pound foolish,’ dates back to the mid-1600s.

But it’s also time for engineering teams to reassess what they do, how they verify it, and what’s considered good enough—and not good enough. While an antenna problem can be fixed with a case and improved in the next release, it’s a lot harder when the system is mission-critical and affects lives. Nevertheless, there are costs for both. Consumer trust takes years to build, but in the Internet era it can be destroyed in a matter of days.

Perhaps even more daunting, the cause of these problems can stem from outside of a company. A complex ecosystem of parts means that a company is only as good as its partners and its quality control procedures for third-party IP. If it doesn’t work right it’s the chipmaker’s problem. And in the case of fabless semiconductor companies, it might even be a glitch in the manufacturing process.

Getting all of this right starts with a great concept and a flawless design, but it doesn’t stop there. It’s now all about the system, and the definition of the system and what can go wrong inside that system are growing.

–Ed Sperling