It’s mostly for data scientists, but not always.
AI training and inference are all about running data through models — typically to make some kind of decision. But the paths that the calculations take aren’t always straightforward, and as a model processes its inputs, those calculations may go astray. Normalization is a process that can keep data in bounds, improving both training and inference.
Foregoing normalization can result in at least two problems. First, activation values can saturate at the numerical extremes during AI training, which can cause the training process to fail. Second, training may create biased results by over-fitting examples that are more frequently represented in a training set.
“Normalization is a fundamental technique in deep learning to stabilize training, improve convergence, and enhance model generalization,” said Jay Pathak, senior director, research and development at Ansys. “It fixes the data distributions coming from different sources and removes bias.”
But how normalization takes place can vary dramatically. Typically, it is worked into models by data scientists.
“Normalization is almost always part of the algorithm,” said Russ Klein, program director for the High-Level Synthesis Division at Siemens EDA. “In other words, when I’m in my machine learning framework, and I’m figuring out how we get from this set of inputs to a really good prediction, I might normalize that data.”
Some normalization may be necessary when implementing embedded models, as well, so system designers also may have to deal with it. It depends on where the model will execute and the tradeoff between execution hardware and accuracy.
Shifting spaces
Normalization is about transforming from one number space to another. In its simplest form, it takes a range of data that may not be uniformly spread around zero and balances it out.
“What normalization is trying to do is make sure that no one feature gets more priority,” said Nigel Drego, co-founder and chief technology officer at Quadric. “If you’re trying to create a classifier that is going to distinguish between different kinds of cats, the math could work out such that if you’re not normalizing, then the white cats come up more often or are detected more frequently because they have these larger values.”
Linear normalization, which is most common, involves shifting the number axis so the data is balanced around zero, and then scaling by a factor that puts the number in a desired range. “You’re going to get between minus one and one, typically,” said Drego.
The amount shifted is called the beta. The scale factor is called gamma. “Both gamma and beta are learnable parameters for a layer,” said Ram Tadishetti, principal software engineer at Expedera.
Fig. 1: A simplistic normalization example. Three data points are unbalanced around zero, so they are first shifted to put zero in the middle. Then they are scaled to establish the desired max and min values. The amount shifted is called beta; the scale factor is called gamma. Source: Bryon Moyer/Semiconductor Engineering
“Generally, when we speak of normalization, we scale and shift the original data,” Tadishetti said. “The [following formula] zero-centers the data distribution and narrows the data range.”
Here E() means the expected value or mean of the data, while Var() is the variance.
“However, these kinds of transformations might misrepresent the original data distributions,” he explained. “Therefore, after these operations we do:”
Don’t look back
It is possible to reverse any value from the new normalized number space to the original, but for typical utilization this isn’t necessary. Inference doesn’t calculate a numeric value that must be true to the original number space. Instead, all the calculations move to establish the statistical likelihood of a particular result.
That probability isn’t tied to the original number space, so one can repeatedly reset the space and move forward without looking back at the prior spaces. “Once you normalize, that’s it. It’s normalized,” said Drego. “You don’t move back into a de-normalized regime.”
Steve Roddy, chief marketing officer of Quadric, agreed. “You go through multiple normalizations,” he added. “Where you started is long since gone.”
Normalization also doesn’t have to be linear. “Sometimes you might take the square root of the numbers, or something like that,” said Klein. “The big numbers would be more impacted than the small numbers because those big numbers are going to have a larger effect downstream through the calculations.”
Pathak provided an example. “While not very common, nonlinear normalization is necessary when there are very complex data distribution relationships, such as when using log scales or power laws,” he said. “A great example is in Ansys simulations, where stress values (order of 10-12 – 1012) need to be mapped to displacement values (0 to 1). Regardless of the network used, learning is impossible due to crossing the boundaries of single or double precision. In such cases, nonlinear scaling — such as applying a log scale to stress values — is necessary.”
Generalizing data sets
Most normalization becomes an integral part of a model or algorithm. In this guise, it’s the domain of the data scientists developing that model. Someone implementing the model in a data center, for example, may be unaware of any normalization. Data centers typically employ hardware that can run inference on a model at full 32-bit floating-point (FP32) resolution, so users have no need to monkey with the number system.
Data scientists have at least two reasons for adding it to the model itself, data regularization and desaturating numbers. But depending on the goal, different types of normalization may be applied to achieve the desired effects. “There are four variations of normalization — batch, layer, instance, and group normalization,” said Sharad Chole, chief scientist and co-founder at Expedera.
Regularization ensures that a given model isn’t overly biased toward one sample or class of samples. “It means that you’re not just learning part of the data set,” said Chole. “You’re trying to learn across the data set.”
This has been a common requirement for convolutional neural networks (CNNs) implementing vision applications. Training proceeds over a huge number of image samples. For any given sample, the pixels will have some range of values, typically in three different channels for red, green, and blue.
However, that range will be different for each sample. If the number range is set on the basis of a single sample, then future samples may have values that fall outside that range. The idea of batch normalization is to establish a number scale that encompasses all the samples in the training set.
“During training, different batches might have different images,” Drego explained. “I will get different results based on the subset of images that I have in one batch versus another, so I need to normalize for this.”
In other situations, the values at the extremes might be viewed as outliers, which might be viewed as somehow anomalous and better off omitted. That’s not the case with training samples, however. If the datasets exclude outliers, it could have a negative impact on inference accuracy.
“There are often outliers that need to be preserved through normalization until you get to the next convolution, or the next matrix multiplication in the case of transformers,” said Drego. “Outliers capture important information about what you’re trying to infer. There has been some correlation between outliers and tokens that are not actually words — things like commas and periods and tokens that don’t necessarily map to words or phonemes.”
There is no one “right” batch normalization. A different set of samples might result in its own different normalized number space. There are theoretical max and min values — all bits 1 or 0 — but that space contains a huge number of unrealistic values. Instead, batch normalization attempts to utilize a range that’s efficiently occupied, with no huge unpopulated spaces at either end, while balancing it around the origin.
Fighting activation saturation
Another reason for normalizing is that, for some training sequences on certain types of models, activation values may start to collect at one extreme or the other. “As the algorithm moves on, it will steer off to the right or left, and you’re guiding it back into this area where you can do the calculations and continue on,” Klein said.
The effects of this can be significant. “As you increase the layers, your gradient is pushing your weights to extremes, and as they’re pushing the weights to extremes, the activation that gets result get very close to the numerical maximum or minimum,” Chole said. “And as the activation starts going to the numerical maximum or minimum, the activation functions cannot really get anything out of them, and when they don’t get anything, that means the error propagation doesn’t happen. When the gradient starts vanishing, you cannot train any further.”
The need to include outliers is important here, as well, but now the values at the extreme aren’t outliers — they may be most of the values. In such a situation, the number space is very poorly used, with huge empty spaces and then a lot of values clustered up near the maximum (positive or negative). As compared to the dynamic range of the space, those values occupy a narrow sliver, and if they’re truly saturating, then values that ought to be different end up being very close or even maxing out at the same value.
Fig. 2: If the values calculated in a layer start to cluster near one end of the number space, normalizing can spread them out more evenly in a new number space. The next layer will employ that new number space. Such layer normalization can occur multiple times during inference. Source: Bryon Moyer/Semiconductor Engineering
That situation starts to eliminate the gradient that is so important to getting a model to converge during training. Unchecked, training may fail. In this situation, data scientists may introduce a normalization step periodically throughout the model. There are multiple ways to implement it, but most common today is layer normalization, where an entire layer is normalized after it has executed. This has proven particularly useful for attention-based algorithms such as large language models (LLMs).
“Layer normalization is trying to normalize all the dimensionality of the tensor into a single mean and single variance,” said Chole. “This is typically applicable for tasks that are recurrent because you don’t really know what you’re going to generate in future, and you might not even know the entire context that you are processing.”
“Transformers got their start in natural language processing, and there it makes more sense to process over, say, a sentence or a word, even to normalize over a smaller group,” said Drego. “it’s operating on features instead of blocks of data, batches of data, and so it gives you the flexibility to either do it over a word or part of a sentence or a full sentence. With batch normalization, you don’t get that.”
Batch or layer normalization can occur frequently in a model. “if you look at CNNs, a batch norm is usually done after every convolution,” explained Drego. “With Transformers, layer norm is typically done in between each attention block.”
“This idea of trying different normalizations has been around, but most people don’t like to use it unless it’s necessary,” said Drego. “And people have found that layer normalization is necessary for attention-based networks.”
More specialized normalization types
Less frequently employed options include group and instance normalization. Group normalization may group channels, for example, when normalizing. “Group normalization is used in CNNs and image worlds, alongside instance normalization, to handle different groups’ channel inputs,” said Pathak. “It is useful when batch normalization is less effective, particularly in CNNs where local and group learning are more relevant than global learning. This is common in applications such as object detection and medical imaging where there may be inconsistencies across groups of images coming from different sources.”
Instance normalization applies different normalizations for different samples. “Instance normalization focuses on handling specific channels in CNN inputs,” said Pathak. “It is more localized and spatial than group normalization. Typical use cases include image-to-image translation, super-resolution, and style transfer, where the focus is on single-channel inputs.”
All these types of normalization are typically done by data scientists, and the effects of the normalization are baked into the model. Batch normalization occurs during training, but when employed, it sets the number range for the complete model. Other types of normalization are performed during inference, but they are built into the kernels that execute each layer, so the fact that normalization is happening is invisible to the users.
Normalizing during inference rather than training
When implementing such a model in an embedded system of some sort, however, it’s unlikely that the platform running inference will employ FP32. The hardware would be too expensive and consume too much energy. In this form, normalization resembles quantization and may be necessary to squeeze a data range into a more manageable space.
“One of the places where we worry about normalization in the implementation is where we’re trying to use reduced precision on our multipliers and our accumulate units,” Klein said. “We might have a fixed-point multiplier that’s got 5 integer bits and 3 fractional bits, so we can store numbers from minus 16 to positive 16. If we have a layer that gives us numbers that scale from 128 to 256, we either have to increase the size of our data representation, and thus the size of our [hardware] operators, or we can scale all the numbers back into the range that hardware can handle. If we just do a linear transformation and go from 128 to 256 and scale it down to minus 16 to positive 16, you’d want to keep the absolute relationship between all those numbers so that you don’t break the algorithm.”
Although unusual, the designer implementing the model in the system would then need to apply normalization followed by retraining and validating that the model still executes correctly and with sufficient accuracy after normalization. “There’s a cycle back through the machine-learning framework to extract the greatest value out of that normalization,” said Klein.
But unlike the normalizations introduced by data scientists, those added for implementation will be valid only for the one specific application. “This is going to be per neural network and data set,” cautioned Klein. “It’s going to be unique to the specific implementation. It really isn’t generalizable.”
The common AI frameworks can deal with normalization. “For frameworks such as PyTorch, we need not mention the gamma and beta explicitly when we declare a layer and use it,” said Tadishetti. “It does a default initialization (gamma to 1 and beta to 0). As part of training, when we do back-propagation, these variables are learned while minimizing the loss function. If we want a custom initialization of these values, there are hooks to change them.”
Proceed carefully
Others counsel caution, as well. “Applying normalization only during inference can be risky,” said Pathak. “A good rule is to use the same normalization during inference that was applied during training, always in consultation with the data scientists who built the model or those that can provide guidance. Most popular edge frameworks, such as TensorFlow Lite and PyTorch Lightning/Mobile, support normalization during inference.”
Normalization’s utility for a quantized model is also debated. Quantization may be sufficient without normalization given hardware implementing typical data types such as 8-bit integers. “Quantization by default maps floating-point real-value data to a finite integer dataset,” noted Tadishetti. “It’s a linear projection from continuous dataset to discrete dataset. Quantization involves scaling the floating-point data, offsetting a fixed delta, and clipping between limits. So having a normalization step post-quantization generally does not add much value since data distribution itself is finite and discrete.”
Work presumably will continue, given the pressure to deploy attention-based networks in smaller edge equipment. “There’s a lot of work going on to try to get, not just the best performance, because performance is one piece of it, but the highest accuracy with the lowest number of bits,” said Jason Lawley, product marketing director for AI IP at Cadence.
Who needs to know about normalization?
If you’re not a data scientist and you’re running your inference on CPUs or GPUs in the data center, you have no need to mess with normalization yourself. The data scientists who created the model will have already worked any normalizations into the algorithm, and you can simply run your model.
If you’re going to run the model on dedicated embedded hardware, however, accuracy may benefit from introducing normalization alongside quantization — particularly when reducing precision and using less standard data types. If performed, retraining afterwards is important to recapturing accuracy, just as it is with quantization.
In some cases, if someone implementing a model finds that certain layers may benefit from normalization, it may be necessary to work with the data scientists that created the model. This works best when the algorithm is completely owned by one company so that the data scientists and implementers can iterate easily if necessary.
Any reluctance to normalization is fading. “Normalization is so general that it’s hard to think of situations where some form of it wouldn’t be beneficial,” said Pathak. “All models benefit from using one type of normalization or another.”
It’s thus an important tool that can be helpful to understand, but actually implementing it is most commonly restricted to those training and refining a model that others will execute. Still, awareness of normalization is important if one finds that developing a reduced version of a model results in poor data behavior.
Related Reading
New AI Data Types Emerge
Several PPA considerations mean no single type of data is ideal for all AI models.
Leave a Reply