Part 2: Short- and long-term solutions to make sure machines behave as expected.
The number of companies using machine learning is accelerating, but so far there are no tools to validate, verify and debug these systems.
That presents a problem for the chipmakers and systems companies that increasingly rely on machine learning to optimize their technology because, at least for now, it creates the potential for errors that are extremely difficult to trace and fix. At the same time, it opens up new opportunities for companies that have been developing static tools to expand their reach well beyond just the chip, where profits are being squeezed by system vendors.
But as shown in part one of this series, that will take years rather than months to fix. Research is just beginning on how to tackle these problems, let alone develop comprehensive tool suites.
“Across the board, machine learning is suddenly becoming very interesting,” said Sundari Mitra, CEO and co-founder of NetSpeed Systems. “If you look at the entire EDA industry in terms of machine learning, what it has gotten us is synthesis. As we get more people coming out with a background in machine learning, with degrees that specialize in this, they will get drawn into fixing some of these things. I see a movement — and when it becomes painful enough, someone will sponsor this. And they will get someone to pay attention to this and apply some of these techniques to solving some of our analog problems, too.”
So what do chipmakers do in the meantime?
Short-term solutions
There are several ways to minimize potential problems today. One is to sharply limit the scope of what machine learning is used for.
“You can come up with machine-learning algorithms that adapt, such as optimizing power based on what happened in the past,” said Mike Gianfagna, vice president of marketing at eSilicon. “But this is a laser-focus on specific problems. You have to limit the scope or you have a huge problem. If you do have a problem, either the data is bad, the adaptive learning is bad, or you need to optimize the algorithm. But there also are a lot of subtleties here like getting the processes right. That isn’t machine learning, but it affects it. As you go to 28nm and below, the rules of physics do not always apply. So you may have a temperature inversion. Is your data solid? You have to make sure the machine-learning algorithm adapts correctly.”
Gianfagna said that with concurrent fault simulation, the number of faulty circuits expands and contracts. “Keeping track of all of those circuits in real time is like managing an expanding and contracting universe. That will take us in a lot of interesting directions, and we will have to do a lot more in parallel the way the human brain functions. If you look at the human brain, it’s not all that fast but it is adaptive. And if you look at machine learning problems, by definition they are not well-behaved.”
A second approach involves more of a hybrid methodology, whereby machines are “taught” based upon programming of specific functions, which is what is being done with smart phones today.
“Machines can learn right from wrong behavior without using cognitive learning or AI,” said Jim McGregor, principal analyst at Tirias Research. “If there is a more complex response required, you use the resources available on a device and leverage the cloud. Most computers are still ‘learning’ in a very basic way. Where this gets much more complex is with autonomous vehicles, where you really do need machine learning and AI to interpret the environment, leverage that environment, and permanently change algorithms. In that case, debug is a whole different world. What changes is that you’re not looking for a problem. Now it’s about instructing and giving feedback.”
A third approach is to confine any changes in functionality to software. While this isn’t optimal from a power/performance perspective, in theory it’s simpler to debug. That approach already is in use by Movidius, which makes vision processing units. “You rely on a network trained in the cloud with enough of a data set,” said David Moloney, Movidius’ CTO. “It’s not a massive engine connected to a CNN (convolutional neural network). It’s done in software, so if there’s a problem it’s a software problem. There are risks with dedicated hardware. With a software platform, you’re not locked into decisions made on hardware.”
Moloney said that one of the big issues with machine learning is that every few days things change. “That includes the tooling algorithms and the network topology,” he said. “This way it doesn’t matter which platform you use. But the deeper the network, the higher the power dissipation. There is not a linear return on power in the network, so the challenge is to come up with networks that are energy-efficient.”
Long-term solutions
A combination of hardware and software is a more effective way to optimize a system, but it’s also harder to control. Machine learning is based upon a huge set of data, which in turn requires big data approaches such as data mining. And debugging systems using data mining falls into the realm of predictive analytics—predicting behavior with patterns.
“What you’re really doing is looking for the outliers in behavior,” said Harry Foster, chief verification scientist at Mentor Graphics. “About 72% of designs today have one or more embedded processors. You start the validation process and often it’s not what you expect. So you do data mining with machine learning techniques and you uncover one. Data mining in this case is being used as a verification tool. With constrained random generation, you’re looking at the latency of an operation. With data mining, you can sweep through a tremendous amount of data and figure out why it’s taking five times longer for an operation. It’s the same with a robot. You can’t teach a robot every aspect of what it has to do. It has to evolve. But that requires whole new ways to validate systems. So while the algorithm is learning, an agent also needs to be learning. That allows you to work back and forth in validation.”
Foster said this will require much more statistical analysis to identify odd behaviors. “With our existing methods, you may have thousands of runs and never see a problem. Using data mining, you can uncover problems you will never find anywhere else and then debug the root cause. But that’s going to be more statistical in nature than what we do today.”
Resilience and fault-tolerance are another long-term approach. This concept has been around in the computing world for decades. Error correcting code memory (ECC) can detect and correct errors in memory bits, for example. Likewise, most systems are backed up and most large corporations have failover systems in case something goes wrong.
“Duplication of IP is increasingly being adopted by the automotive industry,” said Charlie Janac, Arteris‘ chairman and CEO. “That’s one way to fight soft errors. If you use duplicate IPs that transform image processors into packets, then you can get a functional reference signal. If there is a disagreement, you get a fault error. It’s like ECC inside an interconnect. Systems that use machine learning are similar. You need enough probe points to isolate an issue. Eventually we will need fault-resistant SoCs. But we also need to understand what is the class of things that can go wrong. We don’t yet know what all of the machine learning errors are.”
Machine learning outlook
In fact, what’s surprising about machine learning just how many places it is already in use, given the uncertainty about where errors will show up. While the basic concepts involved in machine learning have been discussed for years, the application of them has been more theoretical than real. For example, what happens if data is not available or a connection is not available? How does that impact machine learning, particularly when it comes to such things as object recognition?
“This is completely new,” said Marie Semeria, CEO of Leti. “First, we have to consider the usage. What do we need? It’s a completely different way of driving technology. It’s a neuromorphic technology push. First you figure out what you need and then you develop a solution.”
Semeria said that Leti is developing chips that are driven by IoT applications, whereby the research house demonstrates the functionality and requests end user feedback to make sure the development effort meets their needs. “It’s more of a system approach to be able to discuss with our customers the global solution. So we completely have changed the way we drive our research. It’s why we are at CES these days.”
This also helps put Intel’s acquisition of Altera into perspective. Intel has been pushing machine learning as a way of optimizing data center operations, but it’s much easier to do that kind of ongoing optimization using an FPGA than an ASIC due to the FPGA’s programmable fabric.
“We’ve been seeing a lot more interest in HIL (hardware-in-the-loop) firmware testing,” said Kevin Ilcisin, vice president of product marketing at National Instruments. “The idea is that you can change the hardware fabric in the FPGA, but you need to be able to abstract the measurements.”
Some of these tools already exist, so not all of this needs to be developed from scratch. But how those tools ultimately will be used by companies involved in machine learning, and what else will be required to improve their effectiveness, isn’t clear yet. This is particularly true for the automotive sector, where driver assistance will be a collection of scenarios learned by machines.
“Real-world testing becomes more and more of a requirement, where you look at hardware and software,” said Achim Nohl, technical marketing manager for high-performance ASIC prototyping systems at Synopsys. “We’re going to see more and more convolutional neural network accelerators and we are going to have to expand from design verification to system validation in the real world. You need to have the highest confidence possible that the system will react correctly. Is there a need to standardize scenarios? Is there a need to standardize reference data sets so a car can react on them rather than worrying about the performance of a specific system? Today, we standardize test data, but there is nothing in regard to real-world scenarios.”
Nohl noted that also will require test data to be generated and shared quickly enough to be useful. “This isn’t just a software or a hardware problem. We’re going to require long debug trace in almost every field.”
Conclusions
Because machine learning is just getting going, there is a dearth of hard data about what will be required to develop effective tools, when they will be ready, or how much they will cost. At this point, no one is quite sure what those tools will even look like.
“Once you have data, the bigger question is what you are going to do with it,” said Raik Brinkmann, president and CEO of OneSpin Solutions. “This is a system without a testbench or any way to generalize behavior and verify it. The way chips are being built is changing because of machine learning. You basically will have a big computer in a car, and it won’t work if there is a delay in the data. So you have sensor fusion and you push everything you can as far to the center as possible and you create SoCs that can do machine learning as close to the sensor as possible.”
Just how disruptive this approach becomes remains to be seen. “The nature of correctness is different,” said Chris Rowen, a Cadence fellow and CTO of the company’s IP Group. “You have to adopt statistical measures to govern that correctness. Some say this is an opportunity to change the nature of verification at the lower level. Some propose new approaches that are much more highly parallel architectures. But one thing we do know is that deep learning and machine learning help open new degrees of freedom in design.”
Related Stories
What’s Missing From Machine Learning Part 1
Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.
Inside AI and Deep Learning
What’s happening in AI and can today’s hardware keep up?
What Cognitive Computing Means For Chip Design
Computers that think for themselves will be designed differently than the average SoC; ecosystem impacts will be significant.
New Architectures, Approaches To Speed Up Chips
Metrics for performance are changing at 10nm and 7nm. Speed still matters, but one size doesn’t fit all.
One On One: John Lee
Applying big data techniques and machine learning to EDA and system-level design
Leave a Reply