What’s Missing From Machine Learning

Part 1: Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.


Machine learning is everywhere. It’s being used to optimize complex chips, balance power and performance inside of data centers, program robots, and to keep expensive electronics updated and operating. What’s less obvious, though, is there are no commercially available tools to validate, verify and debug these systems once machines evolve beyond the final specification.

The expectation is that devices will continue to work as designed, like a cell phone or a computer that has been updated with over-the-air software patches. But machine learning is different. It involves changing the interaction between the hardware and software and, in some cases, the physical world. In effect, it modifies the rules for how a device operates based upon previous interactions, as well as software updates, setting the stage for much wider and potentially unexpected deviations from that specification.

In most instances, these deviations will go unnoticed. In others, such as safety-critical systems, changing how systems perform can have far-reaching consequences. But tools have not been developed yet that reach beyond the algorithms used for teaching machines how to behave. When it comes to understanding machine learning’s impact on a system over time, this is a brave new world.

“The specification may capture requirements of the infrastructure for machine learning, as well as some hidden layers and the training data set, but it cannot predict what will happen in the future,” said Achim Nohl, technical marketing manager for high-performance ASIC prototyping systems at Synopsys. “That’s all heuristics. It cannot be proven wrong or right. It involves supervised versus unsupervised learning, and nobody has answers to signing off on this system. This is all about good enough. But what is good enough?”

Most companies that employ machine learning point to the ability to update and debug software as their safety net. But drill down further into system behavior and modifications and that safety net vanishes. There are no clear answers about how machines will function once they evolve or are modified by other machines.

“You’re stressing things that were unforeseen, which is the whole purpose of machine learning,” said Bill Neifert, director of models technology at ARM. “If you could see all of the eventualities, you wouldn’t need machine learning. But validation could be a problem because you may end up down a path where adaptive learning changes the system.”

Normally this is where the tech industry looks for tools to help automate solutions and anticipate problems. With machine learning those tools don’t exist yet.

“We definitely need to go way beyond where we are today,” said Harry Foster, chief verification scientist at Mentor Graphics. “Today, you have finite state machines and methods that are fixed. Here, we are dealing with systems that are dynamic. Everything needs to be extended or rethought. There are no commercial solutions in this space.”

Foster said some pioneering work in this area is being done by England’s University of Bristol in the area of validating systems that are constantly being updated. “With machine learning, you’re creating a predictive model and you want to make sure it stays within legal bounds. That’s fundamental. But if you have a car and it’s communicating with other cars, you need to make sure you’re not doing something harmful. That involves two machine learnings. How do you test one system against the other system?”

Today, understanding of these systems is relegated to a single point in time, based upon the final system specification and whatever updates have been added via over-the-air software. But machine learning uses an evolutionary teaching approach. With cars, it can depend upon how many miles a vehicle has been driven, where it was driven, by whom, and how it was driven. With a robot, it may depend upon what that robot encounters on a daily basis, whether that includes flat terrain, steps, extreme temperatures or weather. And while some of that will be shared among other devices via the cloud, the basic concept is that the machine itself adapts and learns. So rather than programming a device with software, it is programmed to learn on its own.

Predicting how even one system will behave using this model, coupled with periodic updates, is a mathematical distribution. Predicting how thousands of these systems will change, particularly if they interact with each other, or other devices, involves a series of probabilities that are in constant flux over time.

What is machine learning?
The idea that machines can be taught dates back almost two decades before the introduction of Moore’s Law. Work in this area began in the late 1940s, based on early computer work in identifying patterns in data and then making predictions from that data.

Machine learning applies to a wide spectrum of applications. At the lowest level are mundane tasks such as spam filtering. But machine learning also includes more complex programming of known use cases in a variety of industrial applications, as well as highly sophisticated image recognition systems that can distinguish between one object and another.

Arthur Samuel, one of the pioneers in machine learning, began experimenting with the possibility of making machines learn from experience back in the late 1940s—creating devices that can do things beyond what they were explicitly programmed to do. His best-known work was a checkers game program, which he developed while working at IBM. It is widely credited as the first implementation of machine learning.

Fig. 1: Samuel at his checkerboard using an IBM 701 in 1956. Six years later, the program beat checkers master Robert Nealey. Source: IBM

Machine learning has advanced significantly since then. Checkers has been supplanted by more difficult games such as chess, Jeopardy, and Go.

In a presentation at the Hot Chips 2016 conference in Cupertino last month, Google engineer Daniel Rosenband cited four parameters for autonomous vehicles—knowing where a car is, understanding what’s going on around it, identifying the objects around a car, and determining the best options for how a car should proceed through all of that to its destination.

This requires more than driving across a simple grid or pattern recognition. It involves some complex reasoning about what a confusing sign means, how to read a traffic light if it is obscured by an object such as a red balloon, and what to do if sensors are blinded by the sun’s glare. It also includes an understanding of the effects of temperature, shock and vibration on sensors and other electronics.

Google uses a combination of sensors, radar and lidar to pull together a cohesive picture, which requires a massive amount of processing in a very short time frame. “We want to jam as much compute as possible into a car,” Rosenband said. “The primary objective is maximum performance, and that requires innovation in how to architect everything to get more performance than you could from general-purpose processing.”

Screen Shot 2016-08-28 at 10.18.40 AM
Fig. 2: Google’s autonomous vehicle prototype. Source: Google.

Programming all of this by hand into every new car is unrealistic. Database management is difficult enough with a small data set. Adding in all of the data necessary to keep an autonomous vehicle on the road, and fully updated with new information about potential dangerous behavior, is impossible without machine learning.

“We’re seeing two applications in this space,” said Charlie Janac, chairman and CEO of Arteris. “The first is in the data center, which is a machine-learning application. The second is ADAS, where you decide on what the image is. This gets into the world of convolutional neural networking algorithms, and a really good implementation of this would include tightly coupled hardware and software. These are mission-critical systems, and they need to continually update software over the air with a capability to visualize what’s in the hardware.”

How it’s being used
Machine learning comes in many flavors, and often means different things to different people. In general, the idea is that algorithms can be used to change the functionality of a system to either improve performance, lower power, or simply to update it with new use cases. That learning can be applied to software, firmware, an IP block, a full SoC, or an integrated device with multiple SoCs.

Microsoft is using machine learning for its “mixed reality” HoloLens device, according to Nick Baker, distinguished engineer in the company’s Technology and Silicon Group. “We run changes to the algorithm and get feedback as quickly as possible, which allows us to scale as quickly as possible from as many test cases as possible,” he said.

The HoloLens is still just a prototype, but like the Google self-driving car it is processing so much information so fast and reacting so quickly to the external world that there is no way to program this device without machine learning. “The goal is to scale as quickly as possible from as many test cases as possible,” Baker said.

Machine learning can be used to optimize hardware and software in everything from IP to complex systems, based upon a knowledge base of what works best for which conditions.

“We use machine learning to improve our internal algorithms,” said Anush Mohandass, vice president of marketing at NetSpeed Systems. “Without machine learning, if you don’t have an intelligent human to set it up, you get garbage back. You may start off and experiment with 15 things on the ‘x’ axis and 1,000 things on the ‘y’ axis, and set up an algorithm based on that. But there is a potential for infinite data.”

Machine learning assures a certain level of results, no matter how many possibilities are involved. That approach also can help if there are abnormalities that do not fit into a pattern because machine learning systems can ignore those aberrations. “This way you also can debug what you care about,” Mohandass said. “The classic case is a car on auto pilot that crashes because a chip did not recognize a full spectrum of things. At some point we will need to understand every data point and why something behaves the way it does. This isn’t the 80/20 rule anymore. It’s probably closer to 99.9% and 0.1%, so the distribution becomes thinner and taller.”

eSilicon uses a version of machine learning in its online quoting tools, as well. “We have an IP marketplace where we can compile memories, try them for free, and use them until you put them into production,” said Jack Harding, eSilicon’s president and CEO. “We have a test chip capability for free, fully integrated and perfectly functional. We have a GDSII capability. We have WIP (work-in-process) tracking, manufacturing online order entry system—all fully integrated. If I can get strangers on the other side of the world to send me purchase orders after eight lines of chat and build sets of chips successfully, there is no doubt in my mind that the bottoms-up Internet of Everything crowd will be interested.”

Where it fits
In the general scheme of things, machine learning is what makes artificial intelligence possible. There is ongoing debate about which is a superset of the other, but suffice it to say that an artificially intelligent machine must utilize machine-learning algorithms to make choices based upon previous experience and data. The terms are often confusing, in part because they are blanket terms that cover a lot of ground, and in part because the terminology is evolving with technology. But no matter how those arguments progress, machine learning is critical to AI and its more recent offshoot, deep learning.

“Deep learning, as a subset of machine learning, is the most potent disruptive force we have seen because it has the ability to change what the hardware looks like,” said Chris Rowen, Cadence fellow and CTO of the company’s IP Group. “In mission-critical situations, it can have a profound effect on the hardware. Deep learning is all about making better guesses, but the nature of correctness is difficult to define. There is no way you get that right 100% of the time.”

But it is possible, at least in theory, to push closer to 100% correctness over time as more data is included in machine-learning algorithms.

“The more data you have, the better off you are,” said Microsoft’s Baker. “If you look at test images, the more tests you can provide the better.”

There is plenty of agreement on that, particularly among companies developing complex SoCs, which have quickly spiraled beyond the capabilities of engineering teams.

“I’ve never seen this fast an innovation of algorithms that are really effective at solving problems, said Mark Papermaster, CTO of Advanced Micro Devices. “One of the things about these algorithms that is particularly exciting to us is that a lot of it is based around the pioneering work in AI, leveraging what is called a gradient-descent analysis. This algorithm is very parallel in nature, and you can take advantage of the parallelism. We’ve been doing this and opening up our GPUs, our discrete graphics, to be tremendous engines to accelerate the machine learning. But different than our competitors, we are doing it in an open source environment, looking at all the common APIs and software requirements to accelerate machine learning on our CPUs and GPUs and putting all that enablement out there in an open source world.”

Sizing up the problems
Still, algorithms are only part of the machine-learning picture. A system that can optimize hardware as well as software over time is, by definition, evolving from the original system spec. How that affects reliability is unknown, because at this point there is no way to simulate or test that.

“If you implement deep learning, you’ve got a lot of similar elements,” said Raik Brinkmann, president and CEO of OneSpin Solutions. “But the complete function of the system is unknown. So if you’re looking at machine learning error rates and conversion rates, there is no way to make sure you’ve got them right. The systems learn from experience, but it depends on what you give them. And it’s a tough problem to generalize how they’re going to work based on the data.”

Brinkmann said there are a number of approaches in EDA today that may apply, particularly with big data analytics. “That’s an additional skill set—how to deal with big data questions. It’s more computerized and IT-like. But parallelization and cloud computing will be needed in the future. A single computer is not enough. You need something to manage and break down the data.”

Brinkmann noted that North Carolina State University and the Georgia Institute of Technology will begin working on these problems this fall. “But the bigger question is, ‘Once you have that data, what do you do with it?’ It’s a system without testbenches, where you have to generalize behavior and verify it. But the way chips are built is changing because of machine learning.”

ARM’s Neifert considers this a general-purpose compute problem. “You could make the argument in first-generation designs that different hardware isn’t necessary. But as we’ve seen with the evolution of any technology, you start with a general-purpose version and then demand customized hardware. With something like advanced driver assistance systems (ADAS), you can envision a step where a computer is defining the next-generation implementation because it requires higher-level functionality.”

That quickly turns troubleshooting into an unbounded problem, however. “Debug is a whole different world,” said Jim McGregor, principal analyst at Tirias Research. “Now you need a feedback loop. If you think about medical imaging, 10 years ago 5% of the medical records were digitized. Now, 95% of the records are digitized. So you combine scans with diagnoses and information about whether it’s correct or not, and then you have feedback points. With machine learning, you can design feedback loops to modify those algorithms, but it’s so complex that no human can possibly debug that code. And that code develops over time. If you’re doing medical research about a major outbreak, humans can only run so many algorithms. So how do you debug it if it’s not correct? We’re starting to see new processes for deep learning modules that are different than in the past.”

Coming in part two: Short and long-term solutions to validation, verification and debug of machine learning systems.

Related Stories
Plugging Holes In Machine Learning (Part 2 of Series)
Short- and long-term solutions to make sure machines behave as expected.
Inside AI and Deep Learning
What’s happening in AI and can today’s hardware keep up?
What Cognitive Computing Means For Chip Design
Computers that think for themselves will be designed differently than the average SoC; ecosystem impacts will be significant.
New Architectures, Approaches To Speed Up Chips
Metrics for performance are changing at 10nm and 7nm. Speed still matters, but one size doesn’t fit all.
One On One: John Lee
Applying big data techniques and machine learning to EDA and system-level design

  • Yoav Hollander

    Verifying ML-based systems is indeed pretty hard – looking forward to your next installment.