Neural Network Model Quantization On Mobile

Reducing the precision of weights, biases, and activations to enable real-time edge inference.


The general definition of quantization states that it is the process of mapping continuous infinite values to a smaller set of discrete finite values. In this blog, we will talk about quantization in the context of neural network (NN) models, as the process of reducing the precision of the weights, biases, and activations. Moving from floating-point representations to low-precision fixed integer values holds the potential of substantially reducing the memory footprint and latency. This is crucial for deploying models on mobile devices and edge platforms, where runtime computational resources are restricted. There is also an increased focus on quantization’s importance due to the latest developments in generative and Large Language Models (LLM), and the need to bring them to mobile space.

This blog intends to provide a picture of the current state of quantization on mobile (Android) and the opportunities it opens to bring inference of complex NN models to the edge. The first section provides an overview of existing quantization methods and classifications. The second section discusses and compares the main two quantization approaches in TensorFlow Lite (TFLite): Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). Due to the increasing importance of LLMs and generative models, the last section is devoted to some of the challenges of Transformers models, where mixed-precision quantization is the preferred approach.

Bigger is not always better

During the last decade we have witnessed a notable improvement in the accuracy of NNs for a wide range of use-cases. At the same time, we have seen a significant increase in the size of the models. LLMs are a vivid example of this trend. For example, each new version of GPT increases the number of parameters by 100-1000 times, reaching ~1.7 trillion in GPT-4.

Over-parameterization makes it difficult for memory- and computationally constrained devices to execute NN models with acceptable performance and power consumption. This creates a barrier for any deep learning (DL) application that requires real-time inference with low energy consumption and high accuracy in embedded and mobile devices. Such applications cover a wide range of use-cases such as speech recognition, healthcare monitoring, video teleconference.

This is why existing quantization techniques are so important for mobile and embedded devices. These techniques are implemented to reduce the memory footprint and the power consumption, and improve latency, without substantially affecting the accuracy. Achieving this requires a different way of approaching the design, training, and deployment of NN models. Bringing large models to the mobile space and particularly generative models and LLMs would mitigate current problems associated with security and privacy protection when using these models and reduce the edge/server data bandwidth consumption.

The figure below illustrates different types of approaches for quantization.

Fig. 1: Different approaches for quantization of neural network models.

As we can see in the first column of the figure, in general, quantization approaches can be classified in three categories depending on how the real values in the continuous domain R are mapped into discrete, lower precision values in the quantized domain Q (uniform and non-uniform).

Looking at the NN specifically, depending on the level of quantization granularity, there are different approaches listed in the second column. Here an important feature to consider is how the clipping range is calculated for the weights; whether it is calculated considering the weights in a layer, in a group of multiple channels inside a layer, using a fixed value for each channel or for a group of parameters in a layer.

In terms of training, the approaches are classified according to the third column in figure 1. We will focus in this blog on training quantization approaches only. We will look in more detail at these approaches in Quantization Aware Training vs Post-Training Quantization.

Finally, the last column lists several sub-INT8 quantization approaches, from fake and integer only quantization, to extreme binary quantization, that limits the numeric representation to a single bit. We might investigate these approaches in other blogs. Meanwhile, I would recommend reading this excellent survey paper on quantization.

Quantization aware training vs. post-training quantization

As we have seen, quantization is a technique that reduces the precision of the numerical data used in NN models, weights, biases, and activations. Using lower-precision formats, such as 8-bit integers instead of 32-bit floats, quantization can reduce the model size, memory footprint and computation time, which are relevant factors to consider when deploying models on mobile devices and edge platforms. Quantization can also improve energy efficiency and battery life on mobile devices, as well as enabling hardware acceleration on specialized processors that support low-precision arithmetic.

The previous section shows different approaches to quantize neural network models. Depending on when and how quantization is applied in connection to training, we can classify quantization approaches into two categories. Quantization Aware Training (QAT) performs quantization by retraining the model, while Post-Training Quantization (PTQ) applies quantization without retraining the model.

Figure 2 shows a schematic comparison of these two approaches.

Fig. 2: Schematic comparison of PTQ (left) and QAT (right) training approaches.

Post-training quantization

PTQ is a quantization technique that can reduce model memory footprint while also improving inference latency on the CPU and hardware accelerators, without additional training. Nevertheless, the model degradation in terms of accuracy can be substantial in some cases. This type of quantization is applied when an already-trained float TF model is converted to TFLite format using the TensorFlow Lite Converter.

The following table compares available options in TF to apply post-training integer quantization. An additional option for Float16 quantization is not considered here. The size reduction in the middle column is relative to Float32 model representation.

Table 1: Different options in TensorFlow for PTQ (source here).

Dynamic range quantization

The dynamic range quantization has the advantage that it reduces the memory footprint and provides performance improvement without the need of a representative dataset for calibration, so it is a recommended option as starting point. This option dynamically quantizes activations to 8-bits based on their range and performs computations with 8-bit weights and activations. As a result, it provides latencies close to full integer inference, but as the outputs are still stored using FP32, the speedup is less.

The following code snippet shows how to invoke PTQ with the TFLite converter.

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

DEFAULT option is the only one available, as others have been deprecated, but only the weights are quantized. It is possible to quantize also variable data such as model input/output and intermediates between layers, but for this we need to provide a generator function known as RepresentativeDataset. It provides a set of input data large enough to represent typical values, enabling the converter to estimate a dynamic range for all the variable data.

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
tflite_model_quant = converter.convert()

After this, all weights and variable data are quantized, and the model is significantly smaller compared to the original TensorFlow Lite model.

Full integer quantization

The integer quantization approach converts FP32 float numbers (weights and activation outputs) to the nearest INT8 fixed-point numbers. As table 1 shows, this provides a reduction of the model size similar to dynamic range quantization, but it is possible to achieve higher speed up during inference.

This is the quantization approach to use for integer-only accelerators such as the Edge TPU.

The code snippet listed above implements dynamic range quantization, but this model is not yet compatible with devices that perform only integer-based operations, as the TFLite Converter leaves the model input and output tensors in float format. To ensure an end-to-end integer-only model, we need some additional tweaking.

Let us convert the model again but this time with some different parameters, by adding a couple of new lines to the former code snippet after setting the representative data set:

converter.representative_dataset = representative_data_gen
# Throws an error if the converter can’t quantize an operation
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set input and output tensors to UINT8
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model_quant = converter.convert()

Now we have a fully integer quantized model that uses integer data for the model’s input and output tensors, so this model is compatible with integer-only hardware.

You can follow here a full tutorial on Google Colab on post-training integer quantization.

Quantization Aware Training (QAT)

QAT combines the quantization operation into the training or fine-tuning process. This technique emulates inference-time quantization. Model weights are trained in low-bit representation directly during the forward pass. It means that QAT trains DL models with already quantized weights and activations (8-bit instead of 32-bit float), from the start of training. This leads to better performance at the cost of additional training time and computation. It is different from PTQ that converts a pre-trained model to a lower-precision integer format after training is complete. After QAT the model should be converted to TFLite format using the TFLite converter.

After QAT it is recommended to fine-tune the model on the same pre-training dataset or a representative version of it.

As with PTQ, QAT brings clear benefits in terms of model compression by reducing model size by 4x (32-bit/8-bit). QAT additionally provides better accuracy at the expense of additional training. In terms of inference performance, we see a 1.5 – 4x improvement in the CPU backend. Currently, the GPU backend only supports floating point operations. When TFLite runs quantized models on the GPU, in practice the GPU executes a kind of floating-point interpretation of the original model. This means that weights and biases are de-quantized once in GPU memory. Inputs and outputs are dequantized and quantized again on each inference, among other operations to simulate quantized behavior.

TF and TFLite report very limited impact of QAT on accuracy. For example, for image classification with tools (see here), quantized versions of MobilenetV1, MobilenetV2 and Resnet V1 have much less than 1% difference in accuracy compared to non-quantized versions.

In TFLite, QAT is part of the model optimization package tfmot, so you need to install it:

! pip install -q tensorflow
! pip install -q tensorflow-model-optimization

Let us assume we already have a pretrained model with pretrained weights “pretrained_weights”.

import tensorflow as tf
import numpy as np
import tensorflow_model_optimization as tfmot

base_model = setup_model()
quant_aware_model = tfmot.quantization.keras.quantize_model(base_model)

We set up our model and load the pretrained weights. Next step is to invoke the Keras API quantize_model function. This function quantizes a tf.keras model with the default quantization implementation. This quantization builds a model which emulates quantization during training. In this way the model learns parameters which are robust to quantization loss and models the accuracy of a quantized model. The function doesn’t modify the weights of the original base model. You can invoke at the end quant_aware_model.summary() function to get a table that displays the layers of the model, the output shape of each layer, and the number of parameters in each layer.

You can read here a comprehensive guide for QAT in Keras and run in Google Colab the code shared. If you run in Google Colab the example provided here, you will see that there is practically no accuracy loss when comparing the quantized model with the base model after using QAT. In principle, the accuracy achieved could be even better, because during fine-tuning we allow the model to train for more epochs. This can lead to better accuracy than just a float model.

A known issue with current QAT implementations is that accuracy achieved during the training is not the same as accuracy for the converted model, because during conversion more optimizations are applied.

An alternative approach to QAT is Knowledge Distillation. The full-precision model can be considered as the teacher and the lower-precision model as the student. The low-precision model can be then optimized with distillation loss. For distillation we usually don’t need to use the original dataset.

QAT and PTQ can be combined with other model optimization techniques like weight pruning and clustering. Arm has actively contributed to TensorFlow’s Model Optimization Toolkit on this subject. This blog post from Arm introduces collaborative techniques for ML model optimization for edge devices. The main idea of this approach is to apply the different optimizations one after another while maintaining the balance between compression and accuracy required for deployment.

Fig. 3: PTQ and QAT decision tree.

Now that we’ve explained the two most-used approaches for quantization, PTQ and QAT, we can consider the decision tree above that shows when to use which approach. As you can see from the picture, the best option is to start always with PTQ, it is the fastest and easier approach. It is a quick way of grasping the impact of quantization on model’s memory consumption and performance. If we are happy with the accuracy achieved after quantization we can stop here and use the quantized model as it is. If the accuracy achieved is worse than needed, then we go for QAT. In this case we need to have access to the dataset and enough computing resources or budget to access them, for example on AWS. After doing QAT it might be that the resulting accuracy is still well below the original model. One of the things we should check here is whether there are layers that are very sensitive to quantization. In this case, we should try excluding some of the layers from the quantization process (see here) and check the impact of it. If accuracy improves, we are heading in the right direction and should experiment with other sensible layers.

Quantization of transformer models: Mixed-precision quantization

We have witnessed over the past few years major advances in AI technologies, such as GPT and other well-known LLMs. It has been possible thanks to the increased computing power available and the innovative architecture of transformers models. This powerful combination has enabled AI models to scale and find applications in a greater number of complex problems than ever before.

Well-known LLM models are based on a transformer architecture. A distinctive feature of these models is their size, given by the large number of parameters used, as can be seen in the table below.

Table 2: Number of parameters in some LLMs.

If we consider that these models use parameters with FP32 datatype, we can do some simple math to calculate approximately the memory footprint.

If 1 FP32 parameter -> 4 bytes, then 1B parameters -> 4x109 bytes = 4GB

As we can see, each billion parameters translate to 4GB of memory footprint, placing LLMs a long way from the memory range handled by embedded and mobile devices. It means that any effort to bring LLMs to mobile space should consider quantization. For example, the smaller LLaMA version with 7B parameters, if quantized to INT8 will reduce its memory footprint by four times, from 28GB to 7GB which is a figure in the range of mobile RAM memory.

Between PTQ and QAT, the two common quantization approaches described above, PTQ is the technique used for quantizing transformer models. Although it does not always provide the best accuracy, it is usually quite cheap to implement, as no additional training is needed. This is especially important for transformers models where the number of parameters is in the order of billions, making training very expensive. Additionally, QAT doesn’t have full layer support for transformer models. For example, there is no QAT implementation of the MultiHeadAttention layer.

More recently, we started to see the use of more aggressive 4-bit PTQ in the effort to make more compact LLMs capable of running on mobile devices. This paper, for example, investigates the feasibility of using INT4 quantization for LLMs, and shows that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, while causing a significant accuracy drop for decoder-only models.

An additional factor to consider when selecting the optimal optimization strategy is the hardware kernel support. For example, GPU matrix multiplication is not supported for certain combinations of data types and as a result CPU fallback takes place, negatively impacting the performance.

Another interesting fact has been detected in this work when applying PTQ to transformer models. The impact of quantization on accuracy can be very different if quantization is extended to activations as well. The authors demonstrate for several models that while weight quantization incurs almost no error on its own, most degradation is due to activation quantization.

With the increase in the number of parameters, an important issue needs to be considered. This paper showcases how, with the increase of models’ number of parameters above 2.72B, regular 8-bit quantization fails to follow reference float 16-bit baseline accuracy. At this point, the paper points to the emergence of extreme outliers in the feature dimensions of the hidden states during inference.

The most straightforward approach to resolve this challenge is to consider implementing quantization at different precision for weights vs activation. The authors implemented a mixed-precision quantization approach that performs 16-bit matrix multiplication for the outlier feature dimensions and 8-bit matrix multiplication for the other 99.9% of the dimensions. With this new approach they can perform inference in LLMs with up to 175B parameters without any performance degradation.


I hope that at this point the reader has an idea of the existing quantization approaches, and in particular the two main quantization techniques, PTQ and QAT, and how to use them on mobile within the TFLite framework. Having a clear concept of how they work will help us decide when to use them. For example, if we don’t have access to the training dataset, then our only option will be PTQ. It is also a fast solution. In case we have access to the training dataset and compute resources, then QAT is the option to choose, as it can provide virtually no accuracy loss. Nevertheless, this is not a guarantee: there are some models with layers that are very sensitive to quantization where QAT does not help.

At this point the reader will also have an idea of the challenges mobile developers face when quantizing transformer models, as an unavoidable step to deploying these models on mobile. In this case, mixed-precision quantization appears as the best approach. Nevertheless, we expect more developments in this field as major efforts continue to bring LLMs and generative models to the edge.

Finally, I would recommend to the reader interested in quantization to have a look at other blogs on this topic released in Arm Community. For example, this blog explains the use of quantization and other techniques to optimize a model for fast inference on Ethos-U microNPU. Another blog shows how to use TFLite to train and quantize a simple recurrent NN compatible with Arm’s embedded NPUs.

Leave a Reply

(Note: This name will be displayed publicly)