Quantization for Deep Learning Models

Computers

Machine Learning

Author

Eliott Kalfon

Published

July 31, 2025

In the latest 5 years, Deep Learning models have grown larger and larger. With numbers like the 175 billion parameters of ChatGPT-3.5, it has become difficult to make sense of their size.

Quantization is a promising method to reduce their memory footprint and make Machine Learning models runnable on mobile phones and IoT devices. This would bring intelligence closer to the end user.

Now, how can quantization be used to reduce the size of Deep Learning models?

This post builds on the short introduction to quantization I wrote two weeks ago. If you haven’t done so already, I recommend reading it.

As a review, quantization is the process of mapping a large or continuous set of values to a smaller set of values. Common examples of quantization include image and audio compression, described in my last article. It is a lossy compression method, meaning that information is lost in the compression process.

From Media Files to Model Files

What do Deep Learning models look like? And how are they represented in memory?

Deep Learning models, also referred to as Neural Networks, are computational graphs inspired by the human brain. While this article will require a working understanding of this concept, neural networks are quickly described below.

Neural Networks as computational graphs ¹

Each neuron is a weighted sum of its inputs passed through an activation function, and sent to the next layer. In the learning process, these models adapt their weights to minimise prediction error over the training data.

Neural Networks are trained with an algorithm called backpropagation. This algorithm first generates predictions on training observations; this is called the forward pass. Model predictions will have a certain error, the difference between prediction and the actual label. Model weights are then updated to minimize the model’s prediction error. This second phase is called backpropagation, as the error gradients are passed back through the network.

In summary, the model generates predictions, forward pass. Then updates its weight to minimize its prediction error, backpropagation. These two steps are repeated many times (also known as “epochs”).

If this is too much jargon, just remember one thing, a Deep Learning model is a large collection of decimal numbers, tuned in the learning process. These decimal numbers, like \(0.3435\) or \(1.234\), are represented in memory as floating point numbers.

Floating Points in Memory

Let’s start with the following example: \(101.23\).

In scientific notation, this number can be written as:

\[ 101.23 = (-1)^0 \cdot 1.0123 \cdot 10^2 \]

When storing a floating point number in memory, this value is broken down into three parts:

Sign: Indicates whether the number is positive or negative (1 bit)
Significand (or Mantissa): Stores the significant digits of the number (for example, the “1.0123” part, typically using 23 bits in single-precision)
Exponent: Represents the power of ten (or two, in binary systems) that the significand is multiplied by (typically 8 bits in single-precision)

So, for our example \(101.23\), the computer would store:

The sign (0 for positive)
The significand (the digits “1.0123” in binary form)
The exponent (the “2” in \(10^2\))

This breakdown allows computers to represent a wide range of decimal values, but it also requires more bits than storing a simple integer. That’s why floating point numbers usually need 32 or 64 bits, compared to just 8 or 16 bits for integers.

Encoding all of the ChatGPT-3.5 175 billion parameters as Single Precision 32-bit Floating Points would require:

\[ 175 \cdot 10^9 \cdot 32 = 5.6 \cdot 10^{12} \]

That is a lot of bits. How can quantization be used to reduce the size of this model?

Lossy Compression and Model Performance

Before moving forward, it is important to remember that in lossy compression, nothing comes for free. This is exactly what makes it interesting. When quantizing images, file size reduction comes at the expense of image quality. When quantizing audio files, white noise started appearing as the number of bits used to encode amplitude decreased.

In the case of Machine Learning models, quantization comes at the expense of model performance. The goal of Deep Learning model quantization is to reduce model size with only a minimal loss of performance.

From Floating Points to Integers

You are given the following weights: [-0.56372453, 1.39399302, 0.55357852, 0.54269608, -1.39940624, 1.44327844, -0.48863919, 0.22549443, -0.90172897, 0.46398788]

How could you map these to a 4-bit integer? As a reminder, a signed 4-bit integer can store \(2^4\) distinct values: \(-8, -7, \ldots, 0, \ldots, 6, 7\)

To do so, you can compute a scaling factor with the following formula:

\[ \text{scale} = \frac{x_{\max} - x_{\min}}{q_{\max} - q_{\min}} \]

With:

\(x_{\max}\) and \(x_{\min}\) the minimum and maximum of the numbers to quantize; here \(-1.5\) and \(1.5\)
\(q_{\max}\) and \(q_{\min}\) the minimum and maximum quantized value; here \(-8\) and \(7\)

In this case, the scale would be:

\[ \text{scale} = \frac{1.5 - (-1.5)}{7 - (-8)} = \frac{3}{15} = 0.2 \]

Weights can now be quantized using the formula:

\[ q = \text{round}\left(\frac{x}{\text{scale}}\right) \]

With:

\(x\) the number to be quantized
\(q\) the quantized number

Do you notice the \(\text{round}()\) operator? This is where information gets lost. This rounding is what makes the quantization process irreversible. It is an example of lossy compression

Numbers can be dequantized by inverting the process:

\[ x_q = q \cdot \text{scale} \]

With:

\(x_q\) the dequantized number
\(q\) the quantized number

Let’s quantize the weight 1.39399302 as an example. Using the scale of 0.2 computed above, we would get:

\[ q = \text{round}\left(\frac{x}{\text{scale}}\right) = \text{round}\left(\frac{1.39399302}{0.2}\right) = \text{round}(6.95) = 7 \]

to dequantize this number, we would inverty the process:

\[ x_q = q \cdot \text{scale} = 7 \cdot 0.2 = 1.4 \]

This generates a quantization error of \(\approx 0.01\), the difference between the original value and the dequantized number.

Applying Quantization to Neural Network

Going back to Deep Learning models, parameters are stored as matrices of floating point numbers. To reduce their memory size a naive approach would be to apply the quantization process defined above to all of these parameters. This is called Post Training Quantization (PTQ). It has the advantage of being very simple. It can also be applied to any model regardless of their training process.

The issue with PTQ is that it introduces a discrepancy between how the model was trained and how it is evaluated. In training, the prediction (also called forward pass) is done with floating-point numbers. After quantization, one could expect significant performance losses.

Post Training Quantization in Action

To make this more visual, I trained a simple Multi-Layered Perceptron (MLP) on the fashion-MNIST dataset.

In the fashion MNIST dataset, each input image is of \(28 \times 28\) pixels, represents an item of clothing. The goal of the classification model is to classify each item of clothing correctly. There are 10 classes:

0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

One example per class of the Fashion-MNIST dataset

Figure code

import torchvision
import matplotlib.pyplot as plt

# Load dataset
dataset = torchvision.datasets.FashionMNIST('./data', train=True, download=True)

# Class names
classes = dataset.classes

# Find one image per class
images = [None]*10
for img, label in dataset:
    if images[label] is None:
        images[label] = img
    if all(x is not None for x in images):
        break

# Plot
plt.figure(figsize=(10,2))
for i, (img, name) in enumerate(zip(images, classes)):
    plt.subplot(1, 10, i+1)
    plt.imshow(img, cmap='gray')
    plt.title(name, fontsize=8)
    plt.axis('off')
plt.tight_layout()
plt.show()

Python Implementation

If you are not interested in the code, you can directly jump to the results section.

I used a very basic architecture with two hidden layers:

def get_float_model(input_dim, num_classes, hidden_dims=[256, 128]):
   return nn.Sequential(
       nn.Linear(input_dim, hidden_dims[0]),
       nn.ReLU(),
       nn.Linear(hidden_dims[0], hidden_dims[1]),
       nn.ReLU(),
       nn.Linear(hidden_dims[1], num_classes)
   )

With the brevitas package, it is very straightforward to create a quantized model with a similar structure.

class BrevitasMLP(nn.Module):
   def __init__(self, input_dim, num_classes, hidden_dims=[256, 128], bit_width=4):
       super().__init__()
       self.quant_in = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
       layers = []
       prev_dim = input_dim
       for h in hidden_dims:
           layers.append(QuantLinear(
               prev_dim, h,
               weight_bit_width=bit_width,
               bias=True,
               return_quant_tensor=True))
           layers.append(QuantReLU(bit_width=bit_width, return_quant_tensor=True))
           prev_dim = h
       layers.append(QuantLinear(
           prev_dim, num_classes,
           weight_bit_width=bit_width,
           bias=True,
           return_quant_tensor=False))
       self.net = nn.Sequential(*layers)

   def forward(self, x):
       x = self.quant_in(x)
       x = self.net(x)
       return x

It replaces the linear layers by QuantLinear layers. There, you can see the bit_width parameter used for quantization of the weights.

The MLP model, once trained, can be quantized using the following function:

def ptq_brevitas(float_model, input_dim, num_classes, hidden_dims, bit_width, device):
   # Create quantized model
   quant_model = BrevitasMLP(input_dim, num_classes, hidden_dims, bit_width=bit_width).to(device)
   # Copy float weights to quantized model
   float_layers = [m for m in float_model.modules() if isinstance(m, nn.Linear)]
   quant_layers = [m for m in quant_model.modules() if isinstance(m, QuantLinear)]
   for fl, ql in zip(float_layers, quant_layers):
       ql.weight.data = fl.weight.data.clone()
       if fl.bias is not None:
           ql.bias.data = fl.bias.data.clone()
   return quant_model

This function copies the weights of the trained MLP onto the weights of a newly created quantized model.

Models can then be evaluated using a standard evaluation function:

def evaluate(model, test_loader, device):
   model.eval()
   correct = 0
   total = 0
   with torch.no_grad():
       for inputs, targets in test_loader:
           inputs, targets = inputs.to(device), targets.to(device)
           outputs = model(inputs)
           _, predicted = outputs.max(1)
           total += targets.size(0)
           correct += predicted.eq(targets).sum().item()
   return correct / total

Results

As expected, there is a sharp performance drop as the bit-width used to store model parameters decreases:

The trained model represented with floating points has an accuracy well above 80%. It can accurately classify 80% of the test data. This accuracy drops to 60% after quantizing weights to an 8-bit representation. Is there a better way?

Quantization Aware Training

The main issue with PTQ is that it introduces a discrepancy between training and evaluation. A model is optimised for floating point inference. Yet, when it comes to inference, model weights are quantized.

Could this quantization process be simulated at training time? The title of the section may have been a bit of a hint. The short answer is yes.

Intuition

Quantization Aware Training (QAT) works by using quantization during the training process. The forward pass is done with quantized weights. On the other hand, backpropagation is done on the underlying weights.

In other words, the network will generate predictions using quantized weights. The loss will be calculated with these predictions. Backpropagation will be done on the underlying (floating point) weights.

Quantization Aware Training in Action

Let’s make this clearer with a simple example. This example shows Quantization Aware Training with a model with a single weight, and a single training example:

Input: \(x = 1.0\)
Target: \(y = 1.7\)
Model: \(\hat{y} = x \cdot w\)
Initial weight: \(w_0 = 1.0\)
Quantization 4 bits, with scale \(0.2\)
- Int4 range: \(-8, -7, ..., 7\)
- Quantized values: \(-1.6, -1.4, ..., 1.4\)
Loss: \(L = (\hat{y} - y)^2\)
Loss Gradient: \(\frac{\partial L}{\partial w} = 2x(y_{\text{hat}} - y)\)
Learning rate: \(\alpha = 0.1\)

Epoch 1

Quantize weight: \(w_q = \text{round}(1.0/0.2) = 5\)
Dequantize weight: \(w_{dq} = 5 \cdot 0.2 = 1.0\)

Steps one and two are used to simulate quantization loss. Here, there is no information loss as the dequantized weight is the same as the underlying weight.

Forward pass: \(y_{\text{hat}} = x \cdot w_{dq} = 1.0 \cdot 1.0 = 1.0\)
Loss: \((y_{\text{hat}} - y)^2 = (1 - 1.7)^2 = (-0.7)^2 = 0.49\)
Gradient: \(2x(y_{\text{hat}} - y) = 2 \cdot 1.0 \cdot (-0.7) = -1.4\)
Backpropagation: \(w_1 = w_0 - \alpha \cdot \frac{\partial L}{\partial w} = 1.0 - 0.1 \cdot (-1.4) = 1 + 0.14 = 1.14\)

Epoch 2

Quantize weight: \(w_q = \text{round}(1.14/0.2) = \text{round}(5.7) = 6\)
Dequantize weight: \(w_{dq} = 6 \cdot 0.2 = 1.2\)

Note that quantization and dequantization resulted in a quantization error of \(0.06\), the difference between the underlying weight and the dequantized weight.

Forward pass: \(y_{\text{hat}} = x \cdot w_{dq} = 1.0 \cdot 1.2 = 1.2\)
Loss: \((1.2 - 1.7)^2 = (-0.5)^2 = 0.25\)
Gradient: \(2 \cdot 1.0 \cdot (1.2 - 1.7) = 2 \cdot (-0.5) = -1.0\)
Backpropagation: \(w_2 = 1.14 - 0.1 \cdot (-1.0) = 1.14 + 0.1 = 1.24\)

Epoch 3

Quantize weight: \(w_q = \text{round}(1.24/0.2) = \text{round}(6.2) = 6\)
Dequantize weight: \(w_{dq} = 6 \cdot 0.2 = 1.2\)
Forward pass: \(y_{\text{hat}} = 1.0 \cdot 1.2 = 1.2\)
Loss: \((1.2 - 1.7)^2 = 0.25\)
Gradient: \(2 \cdot 1.0 \cdot (1.2 - 1.7) = -1.0\)
Backpropagation: \(w_3 = 1.24 - 0.1 \cdot (-1.0) = 1.24 + 0.1 = 1.34\)

And so on…

Epoch	FP32 Weight	Quantized Weight	Output	Loss
1	1.00	1.0	1.0	0.49
2	1.14	1.2	1.2	0.25
3	1.24	1.2	1.2	0.25
4	1.34	1.4	1.4	0.09
5	1.40	1.4	1.4	0.09
6	1.46	1.4	1.4	0.09
7	1.52	1.4	1.4	0.09

Implementation

All of this overhead is seamlessly handled by brevitas. Once the BrevitasMLP class is created, it can be trained with the same training function as regular pytorch MLPs.

def train(model, train_loader, criterion, optimizer, device, epochs=5):
   model.train()
   for epoch in range(epochs):
       running_loss = 0.0
       for inputs, targets in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
           inputs, targets = inputs.to(device), targets.to(device)
           optimizer.zero_grad()
           outputs = model(inputs)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()
           running_loss += loss.item() * inputs.size(0)
       print(f"Epoch {epoch+1} loss: {running_loss / len(train_loader.dataset):.4f}")

The only change in the architecture is the forward pass handled through the quantized linear layers. It leverages the same backpropagation algorithm.

Results

Quantization Aware Training results in a significant performance improvement with regards to Post Training Quantization. While still performing worse than the 32-bit floating point model, it maintains a very reasonable performance across the different bit-widths.

When writing this article, I was impressed by the QAT 2-bit model, which maintained an accuracy above 80% despite only allowing 4 distinct values for each weight.

Visualizing Quantization

This bar chart shows the performance of the different quantization approaches. What if we could visualize the impact of quantization on model predictions? To do so, I generated a fictional dataset with two classes using the scikit-learn package:

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import numpy as np

def get_2d_dataset(n_samples=1000, noise=0.25):
   X, y = make_moons(n_samples=n_samples, noise=noise, random_state=42)
   X = X.astype(np.float32)
   y = y.astype(np.int64)
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
   return X_train, X_test, y_train, y_test

Figure code

import matplotlib.pyplot as plt
X_train, X_test, y_train, y_test = get_2d_dataset()
plt.figure(figsize=(6,4))
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='coolwarm', s=10)
plt.title('Training Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

Note: as this is a much simpler dataset, I significantly reduced the size of the hidden layers.

I then plotted the decision boundaries for these different models:

As you may have noticed, the decision boundary becomes much coarser as the bit width decreases. The original decision surface is very smooth. The PTQ decision surface is messy. This does not look like a coarsening by design. On the other hand, the QAT model looks optimized for this coarser boundary.

An interesting insight into the impact of quantization on model predictions.

Final Thoughts

This concludes an investigation into quantization for Deep Learning models. It explored how a model can be compressed from a 32-bit floating point representation to 4-bit integer; achieving a compression ratio of 8x!

It described the two main quantization approaches, Post Training Quantization (PTQ) and Quantization Aware Training (QAT). While PTQ is straightforward to implement, QAT delivers the strongest performance at inference.

Combined with low-bit hardware, quantization is a promising method to bring intelligence closer to devices and end-users.

PS: This exploration made me want to explore ML model compression further. Subscribe to my newsletter to receive my next posts on the topic!

Footnotes

By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, Link ↩︎

From Media Files to Model Files

Floating Points in Memory

Lossy Compression and Model Performance

From Floating Points to Integers

Applying Quantization to Neural Network

Post Training Quantization in Action

Python Implementation

Results

Quantization Aware Training

Intuition

Quantization Aware Training in Action

Implementation

Results

Visualizing Quantization

Final Thoughts

Footnotes

Like what you read? Subscribe to my newsletter to hear about my latest posts!