Easier And Faster Ways To Train AI

Simpler approaches are necessary to keep pace with constantly evolving models and new applications.


Training an AI model takes an extraordinary amount of effort and data. Leveraging existing training can save time and money, accelerating the release of new products that use the model. But there are a few ways this can be done, most notably through transfer and incremental learning, and each of them has its applications and tradeoffs.

Transfer learning and incremental learning both take pre-trained models and adapt them to new applications, although each works differently. In addition, there is reinforcement learning, which provides yet another way to improve a model on-the-fly, but it diverges from some of the more familiar deep learning notions.

“Learning styles vary from person to person, but more importantly, they vary from application to application,” said Ashutosh Pandey, lead principal ML/audio systems engineer at Infineon. “We heavily use incremental learning because it provides a balance between generalization and customization. Transfer learning is useful in many audio applications. Reinforcement learning is a different beast.”

There is enormous inequality between companies when it comes to accessing valuable training data. Companies like Amazon, Google, and Facebook acquire enormous troves of data on a daily basis. Many “challenges” or “quizzes” on Facebook, for example, are merely ploys to acquire data. “What did you look like 10 years ago?” stimulates enormous response, allowing Facebook to improve its aging algorithms in ways few other companies can.

That doesn’t mean all other companies are locked out. Training data sets can be acquired. But potentially more interesting are models that already have been trained to a large data set. Those models may be available to smaller companies, and they can be a useful starting place.

Breaking models in two
Sophisticated machine-learning (ML) models have many forms, but most of them can be roughly split into two parts. The first layers of the model identify basic features, such as lines or shapes or other abstract pieces in a visual application. The role of that first part of the model is to do as good a job as possible in identifying those features.

The second half takes the features and figures out something about the input being inferred. In a classification application, that means deciding that some combination of features constitutes a car or a stop sign or a cat.

This splitting of the model into two parts, while potentially over-simplistic, can help to visualize some of the critical differences between transfer learning, incremental learning, and reinforcement learning. It creates a notional distinction between the process of identifying features, and then classifying objects based on the features.

Fig. 1: A simplified depiction of the splitting of a model into the layers that implement feature extraction and the layers that apply the features to do classification. In real models, the break might not be so clean, and it may vary for non-classification models. Source: Bryon Moyer/Semiconductor Engineering

Some actually have a name for this. At Flex Logix, the first part is the “backbone,” and the second part is the “head.”

“The backbone is like the visual cortex,” said Dana McCarty, vice president of sales and marketing for inference products at Flex Logix. “It extracts all the features out of things that you’re seeing, And the head, which would be like your frontal cortex, makes sense of what you’re seeing.”

Transfer learning
Transfer learning is the only one of the three types of learning that is specifically a development approach, done before a model is deployed. The idea is to avoid starting from scratch to train a new model — especially if one doesn’t have millions of labeled samples from which to train.

“The problem is that you don’t always have data,” said Pandey. “Or you might not have the right data.”

Still, this approach is a time-saver. “Just as a designer would take their knowledge from a previous design and apply it to the next one, machine learning needs to do the same,” observed Rod Metcalfe, product management group director in the digital and signoff group at Cadence. “Otherwise, we have to do the same learning process all over again, and clearly that’s inefficient.”

Others agree. “Transfer learning is very important for generalizing machine learning,” said Suhas Mitra, product marketing director, Tensilica AI products at Cadence. “You’re taking knowledge from some domain and applying it to a similar problem in the same domain.”

Google, for example, has trained a visual model to classify 1,000 objects. That may be more objects than any one application needs, but in doing so it has refined the identification of the features required for all of those items. That makes it likely that it will do a decent job on the specific features needed for the subset of objects that a new application may need to classify.

“Let’s say I’m training on a bunch of images like a bike, car, and maybe a dog,” explained Sree Harsha Angara, product marketing manager for IoT, compute, and security at Infineon. “Now let’s say I want to recognize a cat, which it’s never been trained for before. With transfer learning, you assume that the network has learned some of the basic characteristics of how objects differ, and you don’t retrain the entire model from scratch. You retrain just the final layers.”

Far fewer samples are needed for this retraining than would have been needed for the original training, because you’re no longer training the model on how to identify features. You’re training it only on how to use those features when classifying the desired objects.

This training usually is done in the cloud, just like training from scratch. “Training frameworks give you a way to freeze which neurons you train,” noted Angara, allowing only select layers to be retrained.

However, transfer learning may not work if the old and new applications are too different. “I would not say that you take a cats-and-dogs model and train speech synthesis on it,” cautioned Mitra.

There’s also the concept of a “student” model. In order to train the student model, a pre-trained “teacher” model is “queried” by the student model, which learns from the teacher rather than having to be trained from scratch with the original data.

“There is this notion of student/teacher paradigm in machine learning,” explained Mitra. “With transfer learning, you more than likely will need to retrain the network. In the case of student/teacher, you may or may not have to.”

Note, however, that a transfer-learned model may not be optimal. “If we have good access to data and we want to create the most power-efficient solution or the smallest solution or the best response time, we will go with no transfer learning,” cautioned Pandey, training instead from scratch.

Incremental learning: adding classes
Incremental learning means improving existing deployed models. “While transfer learning is where we want to create good generalization, incremental learning is where we want to create field robustness,” said Pandey.

The details of incremental learning, however, can be different for different people. “Incremental learning is where you get better at solving the same kind of problem,” said Infineon’s Angara. “The other dimension is recognizing more classes.”

Both aspects involve the evolution of a deployed model in operation. While this affects models in use, it does not happen while the model is executing. It is still an offline operation.

We saw an example of adding a class through transfer learning, but this is different. “With transfer learning, you almost destroy the original classes and make new classes,” explained Angara. “With incremental learning, I have a bunch of existing classes. I still want to keep those classes, but I want to add something new.”

One implementation of adding classes has been implemented by BrainChip, but that is specific to the company’s use of a spiking neural network (SNN). SNNs operate differently from more conventional artificial neural networks (ANNs) in the way classification is done.

With many ANNs, the final classification is done in the last layer via the softmax function. That function takes a vector of numbers and transforms it into a vector whose members are all between 0 and 1, and whose entries all sum to 1. In other words, they can be interpreted as probabilities.

Each entry in the resulting vector is the probability of the image being classified as a particular thing. If you’ve trained a model to recognize 1,000 things, then that vector will have 1,000 entries, and each entry says how likely it is that the item being classified is that thing.

With a softmax-based ANN, if you change the number of items being classified, you need to redo the softmax calculation with the new vector size, which changes every vector element. So there’s a dependency built in between the items being classified.

It’s different with SNNs. “We train our neurons by repeating spike patterns,” said Anil Manker, CEO and co-founder of BrainChip. “The spikes that have gone through first layers will have a specific pattern, and a final fully connected layer latches onto the pattern. The pattern depends on when the spike happened and how many spikes came.”

There’s no vector that has to be adjusted, and changing the number of items to be classified has no effect on any items other than ones being added or deleted.

In BrainChip’s case, its last-layer nodes have a “training bit” that can be turned on and off. When off, the neuron will operate as expected during inference. Adding an item means that a new node will have the training bit on, and it will watch the spike pattern during the training.

“Once a neuron or node has learned a pattern, you label it,” explained Manker. “When you do that, you have turned off its training. So this neuron will fire only if the incoming pattern matches what it has learned.”

This requires spare neurons. If there aren’t any, and if the model classifies more objects than necessary, then one can “delete” an undesired class and then retrain it.

“If you want to forget a pattern, simply go to that neuron and turn on its learning,” said Manker. “That resets its old learned pattern, and now you can learn a new pattern.”

Note that with ANNs, softmax doesn’t have to be a barrier. “For many of the classification methods, people use ‘metric learning,’ where there is no softmax,” said Pandey.

This uses clustering for classification, which can be done for a new class independently of the existing classes.

Other views of incremental learning
Others see incremental learning as generally improving the capabilities of a deployed model. “Incremental learning is more like taking a trained model and adding more information to it to make it better,” said Mitra. “The idea is that you’ve trained a network in the cloud, but you’re incrementing just a little bit here and there.”

The training algorithm determines where it’s done. There are simplified back-propagation techniques and statistical techniques that can make it possible to modify the training at the edge.

“If you are doing full-blown back-prop, it’s mostly done in cloud,” said Pandey. “If you’re doing statistical or inductive inference, you can do it on the edge.”

Incremental learning can go awry, however. “Great care needs to be taken to protect against catastrophic forgetting,” said Constantinos Xanthopoulos, senior data scientist at Advantest. “This phenomenon arises when neurons in hidden layers are repurposed for the new learning, thus ‘forgetting’ what they had previously learned.”

He provided a real-world analogy to illustrate this. “Assume that we are training a dog. At some point, the dog learns by associating the sound we make with the desired action. At a later date, we want to train the dog to some other action, but the sound of the new command is very similar to that of the first command. After some effort, and likely a lot of confusion for the dog, we manage to train the dog to perform that new action, but what we have achieved is to re-purpose the same association to the new command, thus forgetting the previous one.”

This can be avoided, but with a catch. “One technique is a rehearsal-based approach,” he said. “In rehearsal, prior training instances are repeated, along with the new ones, to help the model to find a generalized solution for both old and new learnings. This approach, though, conflicts with the premise of incremental learning, which is that there is no record of the original training data that can be re-used during the new training.”

In some cases, improvements from many different deployments are sent back up to the “mother model” in the cloud to improve the overall model for redistribution in an update. This is an example of federated learning, where distributed points do the training instead of it happening all in one place.

“Federated learning is taking some meta-information and throwing it upstream,” said Mitra. “Google and Facebook are the biggest proponents of federated learning. It’s an extremely complex problem.”

Incremental learning is still finding its footing. “Despite this being one of the most heavily researched topics in ML research in recent years, we haven’t yet begun to see real-life applications in the semiconductor space,” added Xanthopolous, in reference to the semiconductor test industry.

Reinforcement learning
Reinforcement learning is completely different. It’s a means of training a new model when little or no training data exists. “Data and ground truth may not be available for some applications,” said Pandey.

It learns in the wild, based on feedback that acts as a reward or penalty, depending on how a particular inference turned out. Got it right? Reinforce that approach. Got it wrong? Change the approach.

“Reinforcement learning has the notion of an agent, which is like a body that’s saying, ‘How well am I doing?’” explained Mitra. “This generally happens in the cloud, because reinforcement learning is a very heavy thing.”

There are some distinct differences between reinforcement learning and the prior two types of learning. First, unlike the other two, reinforcement learning happens while the deployed application is running. It is not an offline operation.

Second, it can have unexpected results. Typical ANNs establish limits to what can be identified. One might think of it as saying that if something being identified has features that aren’t in the learned feature set, then the model won’t be able to handle it. Those features must be explicitly engineered in.

With reinforcement learning, there is no such limit. The model will go wherever the reinforcement mechanism sends it, and it can take on approaches that a developer may never have considered.

“With incremental learning, your task is bounded, and your response time is pretty quick,” said Pandey. “With reinforcement learning, you are able to go to places where you have not gone before.”

Fig. 2: A summary of the differences between transfer, incremental, and reinforcement learning. Source: Bryon Moyer/Semiconductor Engineering

Mixing and matching
Some companies may use combinations of these techniques. MicroAI, for instance, does a combination of unsupervised, simple reinforcement, and incremental learning. Its application deals with monitoring data streams to look for anomalies for applications like security or preventive maintenance.

To create the model, MicroAI starts with unsupervised learning, which clusters similar feature sets according to a provided schema. The reinforcement comes through a human in the loop that will label and reinforce certain patterns for future recognition.

“Labeling takes place by subject matter experts to say, ‘This was a belt failure,’ for example, and that would be applied to that data signature,” explained Chris Catterton, head of solutions engineering at MicroAI. “The next time it sees a similar data signature, it’ll say with some level of confidence that it is a belt failure, and that would be reinforced and fed back to the model.”

The company also uses transfer learning via a set of feature-engineering modules that it uses as a starting point. This helps to set the structure of the new model, while the training sets the parameters.

“We’re starting with unlabeled data and with the feature engineering and the data schema,” said Catterton. “Once we get that into place, we’ll start training on the streaming data [for a period of time] that depends the normal cycle for the asset.”

The equipment being monitored may drift as it ages, which is where the incremental learning comes in. Even though the machinery isn’t behaving as it did during the initial training, that may not indicate a reportable problem. Instead, it needs to be retrained periodically or be based on some criterion using the same combination of unsupervised learning and reinforcement labeling.

Update challenges
Updates create a challenge for incremental learning in some applications. Let’s say an automotive OEM implements machine learning to allow the vehicle to recognize and interpret various street signs. That model comes with every identical vehicle sold. But what if the OEM builds in the capability of refining how roads and signs appear in Boston, where roads are tight and curvy and confusing, versus Austin, where there is more space and more of a grid?

Leaving aside the details of how that might happen, this means that as the car is driven, the model slowly changes to adapt to the local environment. Taken to its limit, each model in each car – all of which started out as identical – slowly drifts away from the others, and every model is now custom.

Now let’s say the OEM has been working on the original model and has dramatically improved how features are recognized. It now wants to do an over-the-air update to its fleet. If it replaces the models in each of the cars, then it is effectively doing a “factory reset” and jettisoning the individual learning that each vehicle has done – catastrophic forgetting.

This does not appear to be a scenario that has been considered in many cases. For some lower-volume applications like inspection, the models tend to be restricted to a few machines, and any improvements are likely to be sent to all of them, which does not create a problem.

But the automotive example, where locally improved models would have those improvements replaced by a globally improved model, doesn’t seem to have a ready solution.

Ideally, one could imagine a standard interface between layers, such that one could cleave the model in two, updating the first part to improve feature identification, while leaving the customized last layers. No such standard exists.

There is also a notion of “ensemble learning,” whereby pieces of different models can be brought together. “With ensemble learning, you can have a partial model, and then you can combine the best of both worlds,” said Pandey.

The mechanics of that are within computational reach of the edge, so that notion might allow a car to take an update and combine it with its old model.

It’s also possible to do the combination in the cloud, although that would pose a logistics problem, because in this example the OEM would have to upload the improved models from each car individually, combine it with the update, and then download the merged update individually to each vehicle. That’s a lot of uploading and downloading, which could prove burdensome.

Fig. 3: A comparison between two ways of using ensemble learning to combine an old customized model with general updates. On the left, the combination is done in the vehicle — as long as there is computing capacity for that. On the right, the combination happens in the cloud. But that means that the old models first must be uploaded and then combined to generate the new model, which is then downloaded. This would have to be done individually for each vehicle. Source: Bryon Moyer/Semiconductor Engineering

There’s yet one more possible approach to updates that amounts to logging the local improvements made for incremental learning. When a new model update comes, which doesn’t have those improvements, then you “replay” the logged prior improvements on the new model.

“If I had stored the steps when I did incremental modeling of voice samples, for example, I can go back and redo the same steps with the newer model and adjust it accordingly,” said Mitra.

It might not make sense to update some incrementally trained models. “One application where updates wouldn’t matter would be predictive maintenance,” said Angara. “Let’s say that you have a factory floor with different equipment having different characteristics of failure. Each model would be uniquely tuned to that equipment.”

This can apply to a number of different applications, such as semiconductor inspection. “Once we think we have a confident enough model, we update only when we feel that there are inspection escapes or there’s a large deviation in the design,” said Nabil Dawahre, product manager, X-ray inspection products at Bruker.

Modified models will be a thing
Even though there may be details still to be worked out, transfer, incremental, and reinforcement learning already are happening. Transfer learning is regularly referenced in the development of new algorithms. It brings sophisticated modeling within reach of modest budgets, and it dramatically accelerates time to market.

Incremental learning in the context of adding classes to a classifier is probably not widely practiced today, but as it becomes more available, it may find favor as an extension to transfer learning.

Incremental learning in the sense of simple ongoing improvement is definitely already happening. In some cases, it may be a case of doing updates rather than being thought of as some special kind of learning. But the availability of methods that can retrain at the edge may increase its popularity.

Reinforcement learning has a more limited application space as compared to, say, vision applications using ANNs and SNNs. They are purpose-built for the job they need to do, but less common today due to the specialized knowledge needed for building them.

In one way or another, however, fewer and fewer models will stand still. It’s just a question of when they change, where they change, and whether they change as a group or individually.

Are Better Machine Training Approaches Ahead?
Why unsupervised, reinforcement and Hebbian approaches are good for some things, but not others.
New Ways To Optimize Machine Learning
Different approaches for improving performance and lowering power in ML systems.
Why TinyML Is Such A Big Deal
Surprisingly, not everything requires lots of compute power to make important decisions.
Spiking Neural Networks Place Data In Time
How efficiently can we mimic biological spiking process of neurons and synapses, and is CMOS a good choice for neural networks?

Leave a Reply

(Note: This name will be displayed publicly)