How Good Is Your Data?

As machines begin training and talking to other machines, the question takes on new meaning.


Machines can be taught by other machines. They also can talk to other machines on their own, with no human intervention, which is the great attraction of the Internet of Things.

Sensor clusters or other trucks can pass along critical data that alerts a multi-trailered truck to slow down or take a different route. And sensors feeding a variety of data, such as temperature or vibration, can isolate the cause of those anomalies and recommend maintenance before a problem erupts, or shut down a production line before the problem causes further damage.

And that’s just the beginning. With inferencing, those same trucks may be able to detect changes in weather or traffic patterns well before they need to slow down—possibly even before starting out on their trip. And a control system may be able to do predictive analysis so well that it shifts jobs to a different site just before that maintenance is required, saving both time and money.

These are simple examples to illustrate the usefulness of data. But what if the data is wrong? No machine learning systems in place today understand whether data is dirty, clean, or somewhere in between. For the most part, these systems operate under close human supervision. As these systems become more mature, however, machines will teach machines. That’s the whole purpose of machine learning. It takes best practices for getting one or more jobs done in the context of what those machines are likely to encounter, and it provides an acceptable Gaussian distribution of responses.

In an industrial assembly line, this is fairly straightforward. And with roads, barring an attack of rabid wild animals or the sudden appearance of sinkholes, a truck should be able to navigate common road conditions. But in a more complex situation, such as basing market decisions on social media chatter, the lines become fuzzier. This becomes like the proverbial game of telephone, taught in primary school, where the teacher whispers something to the first student, who then whispers it to the next student, and so on. By the time it reaches the last person in the room, the message no longer resembles the original one.

The same is true for data biasing in machine learning, where the starting point for learning may be slightly off. That skew is carried on and magnified in certain instances. Machines don’t understand nuances that come naturally to people, which is why carmakers are having so much trouble getting autonomous vehicles to make turns when pedestrians are in the crosswalk. The car will sit and wait indefinitely until no one is crossing the street. A person driving the car would slip through in a fraction of the time.

But with machines—as with people—bad data is hard to correct. And it gets compounded by other bad data, so multiple not-so-bad inputs may produce a very bad result, like a model of stock market growth and opportunities heading into 2001, prior to the dot-com crash, and in 2007, prior to the worst downturn in the semiconductor industry’s history. As more machines begin teaching more machines, it becomes even harder to trace the origin of the problem and to correct all of the machine learning that has been built upon that data.

Machines effectively can do things that people cannot. In many cases, they can do things better than people. But the data used to program them, on which all machine learning and modeling is done—and which increasingly will be shared among machines without human intervention—is often not as clean as it needs to be. Like all data, it is subject to bias, human input error, and anomalies that crop up as data is fused together from a variety of different sources.

At this point, however, there is far more effort going into getting these machine learning systems to work and far too little effort being put into making sure the initial information is correct. Bad data requires workarounds, and workarounds are like software patches on software patches. Sooner or later they produce unexpected results, which in the case of machine learning and machine-to-machine communication is an unknown built upon an unknown.

Related Stories
The Darker Side Of Machine Learning
Machine learning needs techniques to prevent adversarial use, along with better data protection and management.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
Plugging Holes In Machine Learning
Part 2: Short- and long-term solutions to make sure machines behave as expected.
What’s Missing From Machine Learning
Part 1: Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.
Building Chips That Can Learn
Machine learning, AI, require more than just power and performance.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.
AI Storm Brewing
The acceleration of artificial intelligence will have big social and business implications.