What is Quantization?

Saving Bits

Author

Eliott Kalfon

Published

July 9, 2025

Subscribe to my newsletter to hear about my latest posts. No spam, I promise.

As Large Language Models get larger and larger, their memory footprint has become a growing concern. Memory is not a problem for ChatGPT or other chat-based web apps that rely on a client-server architecture.

In this architecture, the server is a large cluster of computers (many computers acting as one) that receive requests and send responses. These clusters generally have large memory and compute resources.

Client-server architecture¹

Most of the internet works with the client-server architecture. Your web browser is the typical example of a client. When entering a URL like google.com or eliottkalfon.com, your browser will make a request to web servers, which will send code, formatting and media content in response.

Despite its success, the client-server paradigm has a few drawbacks:

It requires a reliable connection between the client and server
The exchanges between client and server could introduce latency (time between request and response). This could be an issue in time-sensitive and safety critical systems
It involves the transfer of data over a network. Avoiding this data transfer could be more efficient and secure

With the development of mobile technologies and the Internet of Things, computation has been moving closer to the edge of the network; i.e., the last device such as a phone or a sensor.

Going back to Large Language Models, being able to run ChatGPT-like systems or Computer Vision models on smaller devices is one of the next technological frontiers. How could these models take up less space while still performing well?

Model quantization seems like a promising solution. This first article will walk you through the fundamental ideas of this memory saving technique, first looking at numbers, image and audio.

The second article of this series will explore how the same logic can be applied to Deep Learning models.

Data Compression

Quantization is a data compression method. Data compression is the practice of encoding information using fewer bits than the original representation².

Storing Data with Bits

Any data can be stored with bits. To understand how this can happen, this article will first explore how numbers are represented in bits. If you are already familiar with integer and character representation in bits, you can jump to the quantization section directly.

Integers

Data on a computer is stored in bits, which can take the value 0 or 1. As an example, an integer between 0 and 255 can be stored using 8 bits.

Bit Position	7	6	5	4	3	2	1	0
Power of Two	$2^{7}$	$2^{6}$	$2^{5}$	$2^{4}$	$2^{3}$	$2^{2}$	$2^{1}$	$2^{0}$
Decimal Value	128	64	32	16	8	4	2	1
Example	0	0	0	0	1	0	0	1

For example, the binary number 00001001 can be converted to decimal as follows:

$0 \cdot 128 + 0 \cdot 64 + 0 \cdot 32 + 0 \cdot 16 + 1 \cdot 8 + 0 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 = 8 + 1 = 9$

Here are some examples:

1: 00000001
2: 00000010
3: 00000011
4: 00000100
…
127: 01111111
128: 10000000
…
254: 11111110
255: 11111111

Converting integers to binary

To convert an integer to binary, repeatedly divide it by 2, keeping track of the remainder at each step. The remainders, read from last to first, give the binary representation.

As an example, 10 can be converted to an 8-bit integer as follows:

Bit at position 0: $10 / 2 = 5$ , remainder 0
Bit at position 1: $5 / 2 = 2$ , remainder 1
Bit at position 2: $2 / 2 = 1$ , remainder 0
Bit at position 3: $1 / 2 = 0$ , remainder 1

Fill positions 4-7 with 0

We get: 00001010

As a practice exercise, try converting 5 and 122 to binary.

Solution

$5 / 2 = 2$ , remainder 1
$2 / 2 = 1$ , remainder 0
$1 / 2 = 0$ , remainder 1
Fill with zeros: 00000101
$122 / 2 = 61$ , remainder 0
$61 / 2 = 30$ , remainder 1
$30 / 2 = 15$ , remainder 0
$15 / 2 = 7$ , remainder 1
$7 / 2 = 3$ , remainder 1
$3 / 2 = 1$ , remainder 1
$1 / 2 = 0$ , remainder 1
Fill with zeros: 01111010

Signed Integers

The above only refers to positive integers. However, many applications require using negative numbers, such as -3. How would you represent a signed integer using 8 bits?

The sign of a number can be either positive or negative (0 is an exception, refer to the note below for more on this). This means that we can store all of the information about the sign in a single bit, that would be 0 if the number is positive, and 1 if the number is negative. By convention, this is the leftmost bit of a signed integer.

The consequence of using up one of the precious 8 bits is that only 7 bits remain to represent the magnitude of the number, leaving a range of 0 to 127 ( $2^{7} - 1$ ). This means that signed 8-bit integers can represent numbers ranging from -127 to +127.

What about 0?

When dealing with signed integers, there exists both a positive and a negative 0, generally noted +0 and -0. Why is that the case?

The main reason is to maintain certain mathematical properties such as:
$1 / 0^{+} = \infty$
$1 / 0^{-} = - \infty$

However, most programming languages evaluate $+ 0 == - 0$ as True.

Zero really is a fascinating number.

An alternative way to represent signed integers is to use what is called a bias. When using 8 bits, the bias is typically 127.

This makes sure that whatever the number we want to store (as long as it is in the range -128 to 127) can be mapped to a 0-255 8-bit integer.

Stored value: $Original Number + Bias$
Actual value: $Stored Number - Bias$

This means that -3 would be stored as:

$- 3 + 127 = 124$ .

As a practice exercise, convert 124 to binary.

Position 0: $124 / 2 = 62$ , remainder 0
Position 1: $62 / 2 = 31$ , remainder 0
Position 2: $31 / 2 = 15$ , remainder 1
Position 3: $15 / 2 = 7$ , remainder 1
Position 4: $7 / 2 = 3$ , remainder 1
Position 5: $3 / 2 = 1$ , remainder 1
Position 6: $1 / 2 = 0$ , remainder 1
Filling position 7 with a 0

Result: 01111100

To recover the actual signed integer, we subtract 127 from the decoded number. Here:

$124 - 127 = - 3$ .

Characters

Each character used to type this article is also encoded in bits. The first character encoding was the ASCII code, using 7 bits to encode the characters of the English language. Here are some examples:

Character	ASCII Code	ASCII Bit Representation
A	65	1000001
a	97	1100001
B	66	1000010
b	98	1100010
Z	90	1011010
z	122	1111010
!	33	0100001
@	64	1000000
(space)	32	0100000
(DEL)	127	1111111

As most files are encoded in bytes (sequences of 8 bits), ASCII characters are also represented with 8 bits, with the leftmost bit being set to 0 by convention.

Can you think of an issue with using ASCII? It only allows for 128 characters. This is sufficient for the English language only.

Most modern systems use Unicode, an extension to include characters of other languages and emojis. There are more than 1 million possible unicode characters.

Data Compression Basics

The measure of data compression is the compression ratio, it is calculated as follows: $Compression Ratio = \frac{Uncompressed size}{Compressed size}$

As a simple example of file compression, let’s imagine I want to compress the string:

aaaabbbaababaa

This string has 14 characters, encoded with the ASCII system, with one byte per character. What is the size of this uncompressed string?

The string requires 14 bytes or $14 \cdot 8 = 112$ bits.

How would you compress this string?

This string only contains two characters. We could decide to encode each a as 0 and each b as 1. Each character could then be represented as a single bit.

What is the compression ratio achieved by this method?

Solution

The compressed string would require 14 bits (1 bit per character).

Compression ratio: $\frac{112}{14} = 8$

This is a very high compression ratio, made possible by the fact that the original representation of the data was very wasteful.

This is only part of the picture though, as we may want to include a dictionary to the encoded string.

If we store the dictionary (e.g., a=0, b=1) as 2 bytes, the total compressed size is $14 + 16 = 30$ bits.

New compression ratio: $\frac{112}{30} \approx 3.73$

There are two types of data compressions:

Lossy compression: allows some data loss, but generally allows larger compression ratios
Lossless compression: requires no data loss, and generally limits the compression ratios that can be achieved

What type of compression was used in compressing the string aaaabb…?

This is lossless compression as the original string can be perfectly reconstructed from the compressed string.

Something important to remember from lossless compression is that the lower the number of distinct symbols or values, the more efficient lossless compression will be. Had there been a c or a d in the string above, the encoding would have required more than a single bit per character.

Quantization is a lossy compression algorithm, which is what the rest of this article will focus on.

Quantization

Intuition

Quantization is the process of mapping a large set of values to a smaller set of discrete values. This definition might sound abstract at first, so let’s make it more concrete.

Imagine you want to record the temperature outside your house every hour for a year. Temperatures might range from -10°C to +40°C, with precise measurements like 23.7°C or 15.2°C.

If you record these temperatures with full precision (to the nearest 0.1°C), you’d need enough bits to represent 500 possible values (-10.0, -9.9, -9.8, … 39.8, 39.9, 40.0). This would require 9 bits per measurement:

$2^{9} = 512$ possible values.

However, your team realizes that for your application, knowing only the nearest whole degree (23°C instead of 23.7°C) would be sufficient. You decide to quantize the temperatures to the nearest integer, which means you only need to represent 51 possible values (-10, -9, -8, … 38, 39, 40), requiring just 6 bits per measurement:

$2^{6} = 64$ possible values.

To test your understanding:

What would be the compression ratio achieved?
Is this lossy or lossless compression?

Answer

Compression ratio: $\frac{9}{6} = 1.5$

This is a 1.5× compression. Since we’re rounding to the nearest integer and discarding the decimal places, this is lossy compression. Information about the fractional part of each temperature reading is lost in the process.

Applying Quantization to Images

The following image will be used as example:

Images are typically stored in grids of pixels (Pictures Elements). These pixels are defined by three 8-bit integer values ranging from 0 to 255, with each value indicating the intensity of light emitted in the three colours: Red, Green and Blue (RGB). Each of these are referred to as colour channels.

The three colour channels of the picture above are shown in the following figure.

Figure code

import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

img = Image.open('duck.jpg').convert('RGB')
img_np = np.array(img)

fig, axs = plt.subplots(1, 4, figsize=(16, 5))

for i, color in enumerate(['Red', 'Green', 'Blue']):
    channel = np.zeros_like(img_np)
    channel[..., i] = img_np[..., i]
    axs[i].imshow(channel)
    axs[i].set_title(f'{color} channel', fontsize=16)
    axs[i].tick_params(axis='both', labelsize=14)
    axs[i].axis('off')

axs[3].imshow(img_np)
axs[3].set_title('Combined (RGB)', fontsize=16)
axs[3].tick_params(axis='both', labelsize=14)
axs[3].axis('off')

fig.suptitle('RGB Channels and Combined Image', fontsize=18)
plt.tight_layout()
plt.show()

Looking at the memory requirements, each pixel requires $8 \cdot 3 = 24$ bits or 3 bytes. The following image has dimensions 2135 x 2080 pixels. In terms of memory, it requires more than $2000 \cdot 2000 \cdot 3 \cdot 8 = A lot of bits$ .

What if I wanted to reduce its size so that my blog pages load quicker? I could try moving from an 8-bit representation to a 7-bit representation for each pixel value.

Some questions to think about before reading on:

How would you encode and decode the image?
What would be the size of the encoded image?
What would be the compression ratio?

To store an 8-bit into a 7-bit integer, I need to find a way to map the values from 0 to 255 into the range 0 to 127. How could one achieve this?

The encoding process could simply be to divide each pixel value by two and store the integer part of this value. For example:

Original Value	8-bit	Encoded Value	7-bit
14	00001110	7	0000111
15	00001111	7	0000111
233	11101001	116	1110100
132	10000100	66	1000010

To decode the image, reverse the process and multiply all the encoded pixel values by two. This is an example of lossy conversion as 14 and 15 would be encoded as the same number, resulting in lost information.

Did you notice something with the division by 2?

In binary representation, a division by 2 is equivalent to a shift of the bits one step to the right.

Looking at the table above, dividing 14 and 15 is equivalent to shifting all the bits one step to the right. The information encoded in the last bit is lost in the process.

This is typical of quantization, a process through which a large set of values is mapped into a smaller set of values.

The compression ratio for a single pixel would be:

$Compression Ratio = \frac{24}{21} = \frac{8}{7}$

The size of the compressed image would be:

$Compressed size = \frac{24}{21} \cdot Original size$

More importantly, what would the image look like?

Compression from the original 8-bit to 7-bit

Figure code

import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

img = Image.open('duck.jpg').convert('RGB')
img_np = np.array(img)

def quantise(img_array, bits):
    levels = 2 ** bits
    scale = 255 / (levels - 1)
    quantised = np.round(img_array / scale) * scale
    return quantised.astype(np.uint8)

img_7bit = quantise(img_np, 7)

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].imshow(img_np)
axs[0].set_title('Original (8-bit)', fontsize=18)
axs[0].tick_params(axis='both', labelsize=14)
axs[0].axis('off')
axs[1].imshow(img_7bit)
axs[1].set_title('Quantised (7-bit)', fontsize=18)
axs[1].tick_params(axis='both', labelsize=14)
axs[1].axis('off')
plt.tight_layout()
plt.show()

The difference is barely perceivable (if at all).

To achieve a higher compression ratio, you could also encode the pixel values using 6-bit integers, in the range 0-63.

Compression from the original 8-bit to 6-bit

Figure code

img_6bit = quantise(img_np, 6)

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].imshow(img_np)
axs[0].set_title('Original (8-bit)', fontsize=18)
axs[0].tick_params(axis='both', labelsize=14)
axs[0].axis('off')
axs[1].imshow(img_6bit)
axs[1].set_title('Quantised (6-bit)', fontsize=18)
axs[1].tick_params(axis='both', labelsize=14)
axs[1].axis('off')
plt.tight_layout()
plt.show()

Still, no perceivable difference.

This process can be repeated with all bit numbers from 7 to 1:

Compression with varying sizes of integers

Figure code

fig, axs = plt.subplots(2, 4, figsize=(20, 15))
axs = axs.flatten()

bit_depths = [8, 7, 6, 5, 4, 3, 2, 1]
images = [img_np] + [quantise(img_np, b) for b in bit_depths[1:]]

for i, (ax, img, bits) in enumerate(zip(axs, images, bit_depths)):
    ax.imshow(img)
    cr = 8 / bits
    ax.set_title(f'Quantised ({bits}-bit)\nCompression Ratio: {cr:.2f}', fontsize=20)
    ax.tick_params(axis='both', labelsize=14)
    ax.axis('off')

fig.suptitle('Image Quantization at Different Bit Depths', fontsize=30)
plt.tight_layout()
plt.savefig("compression_all_bits.png")
plt.show()

The picture starts to get groovy by the bottom row, when the compression ratio exceed 2.00.

Please bear in mind that there are much smarter ways to compress images. The best example of this is the JPEG compression algorithm⁴, which would deserve its own blog post.

Applying this to audio

How is audio stored on a computer? Audio is represented as a series of amplitudes over time. This amplitude is generally represented by an integer. At the time of writing, signed 16-bit integers are the most common choice, allowing a range from $- 2^{15}$ to $2^{15} - 1$ ( $- 32768$ to $32767$ ). 32-bit integers are also used and allow for higher quality recording. The higher the range of possible values, the more nuances can be captured and encoded.

Figure code

import matplotlib.pyplot as plt
from scipy.io import wavfile

sample_rate, samples = wavfile.read("recording.wav")
plt.figure(figsize=(10, 4))
plt.plot(samples)
plt.title("Audio Waveform", fontsize=18)
plt.xlabel("Sample Index", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.tick_params(axis='both', labelsize=14)
plt.tight_layout()
plt.show()

Similar to the image quantization example, where reducing the number of bits per pixel led to a loss in details, audio can also be compressed by reducing its bit depth—at the cost of quality. The bit-depth is the number of bits used to represent the amplitude of the sound at a given point in time.

The following is a short recording I made using Audacity. The original audio is in 32-bit integer format:

32-bit integers can be converted to 16-bit integers by dividing each sample by $2^{16} = 65536$ . This reduces the file size by half, achieving a compression ratio of 2:1. Here’s the resulting 16-bit audio:

Compression can be pushed further by converting to 8-bit integers. This reduces the range of possible values even more and achieves a 4:1 compression ratio:

As with image quantization, reducing bit depth in audio introduces quantization error. The loss of precision leads to the increasing presence of white noise, especially in the 8-bit version. This noise results from discarded detail and would dominate the signal entirely if all useful information were lost.

Taking a step back

Data compression is the practice of encoding information using fewer bits than the original representation. Quantization is the process of mapping values from a large set to a smaller set. This process compresses the data. It is, however, a type of lossy compression as information is lost through the process. Different numbers can be encoded with the same values, leading to information loss.

The information around us is generally continuous. Storing it into a computer using bits represents the first step in the quantization process. Continuous values are captured into discrete observations, stored into bits. As an example, sound is converted from a wave travelling through air into a series of integers, stored as bits.

To compress data even further, one can reduce the memory size of the integers (e.g., from 32 to 16 bits) used to encode the data. This reduces the range of possible values these integers can take, hence reducing the amount of information that can be stored. This can be done with many types of data including audio and images.

In the next post, we will explore how Neural Network weights, represented as matrices of floating point numbers, can also be quantized. As with image and audio, this results in memory gains and some performance losses. A trade-off at the centre of current Machine Learning research.

Footnotes

Gnome-fs-client.svg: David VignoniGnome-fs-server.svg: David Vignoniderivative work: Calimo, LGPL, via Wikimedia Commons↩︎
Mahdi, O.A.; Mohammed, M.A.; Mohamed, A.J. (November 2012). “Implementing a Novel Approach an Convert Audio Compression to Text Coding via Hybrid Technique” (PDF). International Journal of Computer Science Issues. 9 (6, No. 3): 53–59.↩︎
By Ams100272 - Own work, CC BY-SA 4.0, Link*↩︎
See JPEG on Wikipedia.↩︎

Data Compression

Storing Data with Bits

Integers

Converting integers to binary

Signed Integers

Characters

Data Compression Basics

Quantization

Intuition

Applying Quantization to Images

Applying this to audio

Taking a step back

Footnotes

Like what you read? Subscribe to my newsletter to hear about my latest posts!