How To Measure ML Model Accuracy

What’s good enough for one application may be insufficient for another.


Machine learning (ML) is about making predictions about new data based on old data. The quality of any machine-learning algorithm is ultimately determined by the quality of those predictions.

However, there is no one universal way to measure that quality across all ML applications, and that has broad implications for the value and usefulness of machine learning.

“Every industry, every domain, every application has different care-abouts,” said Nick Ni, director of product marketing, AI and software at Xilinx. “And you have to measure that care-about.”

Classification is the most familiar application, and “accuracy” is the measure used for it. But even so, there remain disagreements about exactly how accuracy should be measured or what it should mean. With other applications, it’s much less clear how to measure the quality of results. In some cases, it may even be subject to personal preference.

Machine learning is intended so solve real problems in the real world. As such, its success or failure depends on a host of considerations. There’s the model, there’s how the model is implemented, and there’s how the implementation integrates with the larger system hosting it. Each of these has many different components that ultimately determine the fitness of a solution for a given application.

The two most easily conflated aspects of the quality of the solution are the model and its implementation. A perfect implementation never can improve the quality of a poor model. And the best model can be ruined by a poor implementation. Only if both the model and implementation are good can we hope for good results overall — as long as any other considerations don’t conspire to hurt the results.

Classification and segmentation models
If model quality is paramount, then how do we measure that quality? We hear the word accuracy blithely tossed around, and yet it has no meaning for many applications. Even for those where it’s appropriate, not everyone agrees on how best to measure it. “Vision has different types — classification, segmentation, instance segmentation, super-resolution, and there are different metrics for these,” said Suhas Mitra, product marketing director, Tensilica AI products at Cadence.

The most popular application is classification, and the most popular type of classification is visual. Even here, there is ambiguity. Is accuracy about picking the right answer? Many applications stop short of a single answer. Instead, they rank possible outcomes with a probability. So do you measure correctness if the desired result is the top pick? Or is it close enough if it ends up in the top 5 or top 10 results?

Once you have a definition for “correct,” one needs to measure how often inferences using that model give a correct result. And for any given classification inference, which many people exemplify with dog vs. cat, there are four possible outcomes. Two of them result from making a correct prediction — either the model said it was a dog and it was indeed a dog, or the model said it wasn’t a dog, and it wasn’t a dog. These are true positives and true negatives. The other two are the false results — a false positive when the model says it’s a dog but it’s not, or a false negative when the model fails to identify a dog.

However, this simplicity holds only if the inference is intended to answer the question, “Which images contain dogs?” What if the question instead is, “Which of the following 20 items is contained in each picture?” Here you’re combining the results of a series of 20 questions that ask, “Is this thing in the image?”

Practical solutions often take this one step further. Not only must an object be identified, but a bounding box must be drawn around the object in the image, with extra credit for having the “box” trace the contours of the object. This is referred to as segmentation, which can itself be divided into semantic segmentation, where only types of object are identified, and instance segmentation, where different instances of the same object are also separated out.

The “quality” question is then further complicated. What if an object is accurately identified, but the bounding box is poorly positioned? If used, the bounding-box success can only hurt the measure, because a good bounding box on an incorrect classification is a failure, while a correct classification with an iffy bounding box may not be quite so serious. “In automotive, it’s not so important to have a most accurate bounding box, but rather, ‘Did I detect the person in front of me?’” Ni said.

What’s the best measure of accuracy?
This gets to a fundamental question about accuracy — should it accommodate the seriousness of errors? If you’re 95% accurate, is that okay, even if inferences that fall into the failing 5% may have catastrophic consequences? Can a measure of accuracy be constructed that takes the seriousness of failure into account?

Steve Teig, founder and CEO of Perceive, believes so. He notes that popular measures of accuracy tend to be driven by the notions of “precision” and “recall.” But it’s just a ratio-of-numbers game. Precision reflects the number of true positives divided by the total number of positives (true and false). Recall measures how many of the “thing” being identified were correctly classified. This is the number of true positives divided by the sum of true positives and false negatives. The latter are the ones that got missed.

So if you have a set of pictures, and 4 or them contain dogs, then if the model identified 3 of them correctly – and it thought that 2 non-dog images had dogs, then the precision is 3/(3+2), while recall is 3/(3+1).

Fig. 1: The notions of precision and recall for use in evaluating classification models. Source: Walber
Fig. 1: The notions of precision and recall for use in evaluating classification models. Source: Walber

Effectively, these measure how targeted the identification is (by having few false positives) and how thorough it is (by finding as many of the things as possible). They’re often combined in a measure unhelpfully called “f1” (also called the “dice score”). They’re associated with an overall mean average precision (mAP) score. There’s also an “IOU” measure for segmentation, comparing inference results to ground truth. “They look at the intersection and the union of the two,” said Cadence’s Mitra.

But it’s strictly a numbers game. If you have a video camera watching for dangerous predators coming onto your farm, then you want to make sure your dog isn’t accidentally identified as a predator. As a start, you could infer a rule saying, “If it’s a dog, then it’s okay.”

If it mistakes an opossum for a dog, that’s not such a big mistake. If it mistakes a wolf for a dog, however, that’s a huge mistake. And yet the numbers would be the same, according to classic accuracy. The seriousness of certain errors or difficulty of identification aren’t measured. “I would get no extra credit for detecting the really hard-to-detect dog,” noted Teig. “But I get no penalty for the fact that my dog mistake was a cat versus a helicopter.”

The numbers also can be brittle, given that in at least one research project a change in a single pixel was able to throw the inference engine off. “Augmentation of anything can significantly manipulate the outcome,” said Mitra. “You can fool the deep learning network into giving you a wrong answer.”

Teig maintains that training data sets shouldn’t focus on the number of data points, but rather the diversity. And the more bizarre the better. “You see one freaky looking dog, so how many votes should that freaky dog get to help the model to understand that dogs come in a much larger range than the model would otherwise be led to believe?” asked Teig.

He describes this through the notion of surprise. “Weird data points are some of the most interesting ones,” he said. “They’re the ones that show that the diversity of ‘dogness’ includes that freaky looking dog, even if it doesn’t look like most of the dogs you’ve seen before. It is weird data points that are telling the model that there exist surprising data points, like the dog with no hair or one ear.”

He says that the standard f1 approach reflects average precision — or average surprise. This involves Shannon’s notion of entropy, which quantifies the minimum data size needed to capture the complexity of a data set. It does so in a way that assumes an optimal encoding of the data. ML models may or may not use an optimal encoding.

Selecting a specific model to train on some data fixes the “encoding” as a result of that model decision. “A rare outcome of an experiment is more surprising than a common outcome of an experiment,” said Gerald Friedland, co-founder and CTO of Brainome. “If each outcome is equally likely, all outcomes have the same surprise. The quantification in bits is explained by Shannon and used in many information-coding strategies, like assigning longer strings to rarer outcomes for compression.”

There’s a related quantity called “memory-equivalent capacity,” or “MEC” that effectively measures the diversity of a data set for a given, possibly sub-optimal, encoding. “MEC is the amount of memory that a machine learning model needs to learn an arbitrary data set,” said Huan Le, business development at Brainome.

The idea is to have enough data of high quality to find the “rule” or “algorithm” hidden within the data. “An overfitting model learns by memorizing as opposed to extracting a rule,” said Le. Once you have enough high-quality data, more isn’t necessarily helpful. “The statement that ‘there’s no data like more data’ is actually incorrect. There’s no data like enough data to figure out the rule.”

Encoding may provide a way for unusual outcomes to be handled differently. But, critically, those outcomes need to be part of the training set. There’s no way to account for a data point that’s never been seen if it lies outside the “rule.”

“Science works only with observations,” said Friedland. “Unobserved things can’t be accounted for unless one finds a model that explains the observations and generalizes beyond the experimental data.”

Teig has posited the notion of “extropy” as an alternative to entropy. This involves the “softmax” function taken on the negative log of probability, and the idea is to minimize maximum surprise rather than average surprise. “We are not interested in average surprise,” insisted Teig. “We’re interested in maximum surprise. Don’t ever do anything stupid. Don’t mistake a dog for a helicopter. It is extreme errors, severe mistakes, that are what people actually want insight into from actual machine learning. And yet nobody who works on machine learning actually works on that.”

It bears noting that, all of these competing definitions notwithstanding, laboratory models generated and tested with clean data may not fare as well when faced with the noisy data from the real world. That can make advertised accuracy – however measured – optimistic as compared to how the model will perform with less carefully curated data.

Non-classification algorithms
Classification problems include far more than object detection. Handwriting recognition or audio transcription also conform to intuitive notions of accuracy, even if their details may differ. But there are many other problems that artificial intelligence tries to solve, and accuracy isn’t necessarily a meaningful measure, no matter how the details are defined.

In the health care industry, complex problems are being solved with the aid of ML. Protein folding is one, where the question being answered is, “For this complex protein, what will its physical shape be when folded?” This has important implications for understanding how an antibody might attach to a virus.

Another example lies in genomics. Among other things, it asks the question, “Given the following symptoms, which genes are responsible?” These problems don’t have such a neat measure as accuracy. You know you’ve been successful if you find the right answer, but confirming that an answer is correct requires much more experimentation.

There’s yet another category of problems generally referred to as “generative.” Super-resolution (SR) image algorithms use AI to improve the quality of images. And yet, how do you measure how much better you made the images? How do you even quantify the before and after results? “Vendors will all promote it differently, but at the end of day it’s really the preference of the viewer,” noted Ni.

If you’re just trying to clean an image up, Mitra said that peak signal-to-noise ratio (PSNR) can be used simply to measure the reduction of noise. Other changes might not be so straightforward, however.

Deep fakes are another example. Here you have to start from a clumsy job that’s not going to fool anybody, maybe cross the “uncanny valley,” and get to a level that’s credible and could fool a discerning eye. But with both this and HR algorithms, there’s an aesthetic component. The success of a solution may vary by individual. For HR, different people may have different personal preferences, so the notion of “better” isn’t hard and fast. For deep fakes, it’s partly a function of how good someone’s eye is.

AI is making its way into the EDA world, as well. But success there is less straightforward to measure. One approach is to compare results to an initial baseline result. “Some AI routines tune the setup of place-and-route tools based on previous results,” said Marc Swinnen, director, product marketing, semiconductor at Ansys. “So the first round, you go in blind, and then you learn from the results.”

Finally, there are the ML models that the data giants of the world run to make recommendations based on their vast troves of data. “Google and Facebook aren’t going to tell us how they measure accuracy,” said Ni. “That’s the secret sauce for data science.”

Fused data
The examples we’ve looked at apply a single algorithm to create an outcome. But some problems require the fusion of different data sources to answer fundamental questions. Automotive applications are the most obvious of these, with multiple cameras and lidar and radar sensors bringing data in. Those results will be combined to answer the question of whether a child is running in front of the car. It may result in autonomous decisions being made.

How should quality be measured there? Should the fused outcome be measured? Or should each individual data stream be measured on its own? Or both?

The existence of these multiple streams is partly one of range. On a car, a single camera can’t give a complete view of everything around the car. It takes multiple cameras. So the fusing takes these multiple images and can create a single merged image.

But the use of lidar and radar helps for conditions that aren’t optimal for a camera — say, at night, when there’s little light, or when the sun is blinding one of the cameras. Here, the additional data streams serve partly as redundancy to cover for cases when accuracy from one model may be insufficient.

Automotive is a good example of a very complex problem being solved — that of identifying how a car should operate in order to get a passenger to a destination as quickly as legally possible, while causing no damage along the way. A simple measure of quality there is extremely hard to identify, and yet it’s clear when the system fails.

In the absence of a good number, and where lives may be at stake, multiple algorithms may be implemented in parallel as redundant checks. Even non-ML algorithms may be employed as a cross-check against failure. “In the automotive space, they’re all trying to figure out the redundancies like radar plus lidar,” said Gordon Cooper, product marketing manager for ARC EV processors at Synopsys.

Looking beyond the model
Our focus has been on the model in isolation, because implementation details cannot improve the quality of the model. So it becomes the critical starting point. Implementation processes like quantization and model pruning reduce quality. The trick is to perform those simplifications with minimal negative impact.

Flex Logix sees customer sensitivities to such changes. “They don’t want the model to be changed,” said Geoff Tate, CEO of Flex Logix. “They don’t want us to take shortcuts that further change the accuracy in an unpredictable way.”

A bigger picture of quality would include other important characteristics of a full system. “Customers have two constraints — they’ve got a money constraint, and they’ve got a power constraint,” said Tate. They’re also likely to look at power and the ability to scale or modify a model on a given platform, although those may be taken in the context of cost and power. “What they want is to get the most inferences per second per dollar and per watt.

Each of these can be measured in different ways. These considerations don’t directly impact model quality, but compromises made in their favor may come at the cost of quality.

Related Stories
Why AI Systems Are So Hard To Predict
Interactions and optimization make it much harder to determine when a system will fail, or even what that failure really means.
Hidden Costs In Faster, Low-Power AI Systems
Tradeoffs in AI/ML designs can affect everything from aging to reliability, but not always in predictable ways.

Leave a Reply

(Note: This name will be displayed publicly)