Subscribe to my newsletter to hear about my latest posts. No spam, I promise.
As Large Language Models get larger and larger, their memory footprint has become a growing concern. Memory is not a problem for ChatGPT or other chat-based web apps that rely on a client-server architecture.
In this architecture, the server is a large cluster of computers (many computers acting as one) that receive requests and send responses. These clusters generally have large memory and compute resources.
Most of the internet works with the client-server architecture. Your web browser is the typical example of a client. When entering a URL like google.com or eliottkalfon.com, your browser will make a request to web servers, which will send code, formatting and media content in response.
Despite its success, the client-server paradigm has a few drawbacks:
- It requires a reliable connection between the client and server
- The exchanges between client and server could introduce latency (time between request and response). This could be an issue in time-sensitive and safety critical systems
- It involves the transfer of data over a network. Avoiding this data transfer could be more efficient and secure
With the development of mobile technologies and the Internet of Things, computation has been moving closer to the edge of the network; i.e., the last device such as a phone or a sensor.
Going back to Large Language Models, being able to run ChatGPT-like systems or Computer Vision models on smaller devices is one of the next technological frontiers. How could these models take up less space while still performing well?
Model quantization seems like a promising solution. This first article will walk you through the fundamental ideas of this memory saving technique, first looking at numbers, image and audio.
The second article of this series will explore how the same logic can be applied to Deep Learning models.
Data Compression
Quantization is a data compression method. Data compression is the practice of encoding information using fewer bits than the original representation2.
Storing Data with Bits
Any data can be stored with bits. To understand how this can happen, this article will first explore how numbers are represented in bits. If you are already familiar with integer and character representation in bits, you can jump to the quantization section directly.
Integers
Data on a computer is stored in bits, which can take the value 0 or 1. As an example, an integer between 0 and 255 can be stored using 8 bits.
| Bit Position | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|
| Power of Two | ||||||||
| Decimal Value | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |
| Example | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
For example, the binary number 00001001 can be converted to decimal as follows:
Here are some examples:
1: 00000001
2: 00000010
3: 00000011
4: 00000100
…
127: 01111111
128: 10000000
…
254: 11111110
255: 11111111
Converting integers to binary
To convert an integer to binary, repeatedly divide it by 2, keeping track of the remainder at each step. The remainders, read from last to first, give the binary representation.
As an example, 10 can be converted to an 8-bit integer as follows:
- Bit at position 0:
, remainder 0
- Bit at position 1:
, remainder 1
- Bit at position 2:
, remainder 0
- Bit at position 3:
, remainder 1
Fill positions 4-7 with 0
We get: 00001010
As a practice exercise, try converting 5 and 122 to binary.
Solution
, remainder 1
, remainder 0
, remainder 1
Fill with zeros: 00000101 , remainder 0
, remainder 1
, remainder 0
, remainder 1
, remainder 1
, remainder 1
, remainder 1
Fill with zeros: 01111010
Signed Integers
The above only refers to positive integers. However, many applications require using negative numbers, such as -3. How would you represent a signed integer using 8 bits?
The sign of a number can be either positive or negative (0 is an exception, refer to the note below for more on this). This means that we can store all of the information about the sign in a single bit, that would be 0 if the number is positive, and 1 if the number is negative. By convention, this is the leftmost bit of a signed integer.
The consequence of using up one of the precious 8 bits is that only 7 bits remain to represent the magnitude of the number, leaving a range of 0 to 127 (
What about 0?
When dealing with signed integers, there exists both a positive and a negative 0, generally noted +0 and -0. Why is that the case?
The main reason is to maintain certain mathematical properties such as:
However, most programming languages evaluate
Zero really is a fascinating number.
An alternative way to represent signed integers is to use what is called a bias. When using 8 bits, the bias is typically 127.
This makes sure that whatever the number we want to store (as long as it is in the range -128 to 127) can be mapped to a 0-255 8-bit integer.
- Stored value:
- Actual value:
This means that -3 would be stored as:
As a practice exercise, convert 124 to binary.
- Position 0:
, remainder 0
- Position 1:
, remainder 0
- Position 2:
, remainder 1
- Position 3:
, remainder 1
- Position 4:
, remainder 1
- Position 5:
, remainder 1
- Position 6:
, remainder 1
- Filling position 7 with a 0
Result: 01111100
To recover the actual signed integer, we subtract 127 from the decoded number. Here:
Characters
Each character used to type this article is also encoded in bits. The first character encoding was the ASCII code, using 7 bits to encode the characters of the English language. Here are some examples:
| Character | ASCII Code | ASCII Bit Representation |
|---|---|---|
| A | 65 | 1000001 |
| a | 97 | 1100001 |
| B | 66 | 1000010 |
| b | 98 | 1100010 |
| Z | 90 | 1011010 |
| z | 122 | 1111010 |
| ! | 33 | 0100001 |
| @ | 64 | 1000000 |
| (space) | 32 | 0100000 |
| (DEL) | 127 | 1111111 |
As most files are encoded in bytes (sequences of 8 bits), ASCII characters are also represented with 8 bits, with the leftmost bit being set to 0 by convention.
Can you think of an issue with using ASCII? It only allows for 128 characters. This is sufficient for the English language only.
Most modern systems use Unicode, an extension to include characters of other languages and emojis. There are more than 1 million possible unicode characters.
Data Compression Basics
The measure of data compression is the compression ratio, it is calculated as follows:
As a simple example of file compression, let’s imagine I want to compress the string:
aaaabbbaababaa
This string has 14 characters, encoded with the ASCII system, with one byte per character. What is the size of this uncompressed string?
The string requires 14 bytes or
How would you compress this string?
This string only contains two characters. We could decide to encode each a as 0 and each b as 1. Each character could then be represented as a single bit.
What is the compression ratio achieved by this method?
Solution
The compressed string would require 14 bits (1 bit per character).
Compression ratio:
This is a very high compression ratio, made possible by the fact that the original representation of the data was very wasteful.
This is only part of the picture though, as we may want to include a dictionary to the encoded string.
If we store the dictionary (e.g., a=0, b=1) as 2 bytes, the total compressed size is
New compression ratio:
There are two types of data compressions:
- Lossy compression: allows some data loss, but generally allows larger compression ratios
- Lossless compression: requires no data loss, and generally limits the compression ratios that can be achieved
What type of compression was used in compressing the string aaaabb…?
This is lossless compression as the original string can be perfectly reconstructed from the compressed string.
Something important to remember from lossless compression is that the lower the number of distinct symbols or values, the more efficient lossless compression will be. Had there been a c or a d in the string above, the encoding would have required more than a single bit per character.
Quantization is a lossy compression algorithm, which is what the rest of this article will focus on.
Quantization
Intuition
Quantization is the process of mapping a large set of values to a smaller set of discrete values. This definition might sound abstract at first, so let’s make it more concrete.
Imagine you want to record the temperature outside your house every hour for a year. Temperatures might range from -10°C to +40°C, with precise measurements like 23.7°C or 15.2°C.
If you record these temperatures with full precision (to the nearest 0.1°C), you’d need enough bits to represent 500 possible values (-10.0, -9.9, -9.8, … 39.8, 39.9, 40.0). This would require 9 bits per measurement:
However, your team realizes that for your application, knowing only the nearest whole degree (23°C instead of 23.7°C) would be sufficient. You decide to quantize the temperatures to the nearest integer, which means you only need to represent 51 possible values (-10, -9, -8, … 38, 39, 40), requiring just 6 bits per measurement:
To test your understanding:
- What would be the compression ratio achieved?
- Is this lossy or lossless compression?
Answer
Compression ratio:
Applying Quantization to Images
The following image will be used as example:

Images are typically stored in grids of pixels (Pictures Elements). These pixels are defined by three 8-bit integer values ranging from 0 to 255, with each value indicating the intensity of light emitted in the three colours: Red, Green and Blue (RGB). Each of these are referred to as colour channels.
The three colour channels of the picture above are shown in the following figure.

Figure code
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
img = Image.open('duck.jpg').convert('RGB')
img_np = np.array(img)
fig, axs = plt.subplots(1, 4, figsize=(16, 5))
for i, color in enumerate(['Red', 'Green', 'Blue']):
channel = np.zeros_like(img_np)
channel[..., i] = img_np[..., i]
axs[i].imshow(channel)
axs[i].set_title(f'{color} channel', fontsize=16)
axs[i].tick_params(axis='both', labelsize=14)
axs[i].axis('off')
axs[3].imshow(img_np)
axs[3].set_title('Combined (RGB)', fontsize=16)
axs[3].tick_params(axis='both', labelsize=14)
axs[3].axis('off')
fig.suptitle('RGB Channels and Combined Image', fontsize=18)
plt.tight_layout()
plt.show()Looking at the memory requirements, each pixel requires
What if I wanted to reduce its size so that my blog pages load quicker? I could try moving from an 8-bit representation to a 7-bit representation for each pixel value.
Some questions to think about before reading on:
- How would you encode and decode the image?
- What would be the size of the encoded image?
- What would be the compression ratio?
To store an 8-bit into a 7-bit integer, I need to find a way to map the values from 0 to 255 into the range 0 to 127. How could one achieve this?
The encoding process could simply be to divide each pixel value by two and store the integer part of this value. For example:
| Original Value | 8-bit | Encoded Value | 7-bit |
|---|---|---|---|
| 14 | 00001110 | 7 | 0000111 |
| 15 | 00001111 | 7 | 0000111 |
| 233 | 11101001 | 116 | 1110100 |
| 132 | 10000100 | 66 | 1000010 |
To decode the image, reverse the process and multiply all the encoded pixel values by two. This is an example of lossy conversion as 14 and 15 would be encoded as the same number, resulting in lost information.
Did you notice something with the division by 2?
In binary representation, a division by 2 is equivalent to a shift of the bits one step to the right.
Looking at the table above, dividing 14 and 15 is equivalent to shifting all the bits one step to the right. The information encoded in the last bit is lost in the process.
This is typical of quantization, a process through which a large set of values is mapped into a smaller set of values.
The compression ratio for a single pixel would be:
The size of the compressed image would be:
More importantly, what would the image look like?

Figure code
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
img = Image.open('duck.jpg').convert('RGB')
img_np = np.array(img)
def quantise(img_array, bits):
levels = 2 ** bits
scale = 255 / (levels - 1)
quantised = np.round(img_array / scale) * scale
return quantised.astype(np.uint8)
img_7bit = quantise(img_np, 7)
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].imshow(img_np)
axs[0].set_title('Original (8-bit)', fontsize=18)
axs[0].tick_params(axis='both', labelsize=14)
axs[0].axis('off')
axs[1].imshow(img_7bit)
axs[1].set_title('Quantised (7-bit)', fontsize=18)
axs[1].tick_params(axis='both', labelsize=14)
axs[1].axis('off')
plt.tight_layout()
plt.show()The difference is barely perceivable (if at all).
To achieve a higher compression ratio, you could also encode the pixel values using 6-bit integers, in the range 0-63.

Figure code
img_6bit = quantise(img_np, 6)
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].imshow(img_np)
axs[0].set_title('Original (8-bit)', fontsize=18)
axs[0].tick_params(axis='both', labelsize=14)
axs[0].axis('off')
axs[1].imshow(img_6bit)
axs[1].set_title('Quantised (6-bit)', fontsize=18)
axs[1].tick_params(axis='both', labelsize=14)
axs[1].axis('off')
plt.tight_layout()
plt.show()Still, no perceivable difference.
This process can be repeated with all bit numbers from 7 to 1:

Figure code
fig, axs = plt.subplots(2, 4, figsize=(20, 15))
axs = axs.flatten()
bit_depths = [8, 7, 6, 5, 4, 3, 2, 1]
images = [img_np] + [quantise(img_np, b) for b in bit_depths[1:]]
for i, (ax, img, bits) in enumerate(zip(axs, images, bit_depths)):
ax.imshow(img)
cr = 8 / bits
ax.set_title(f'Quantised ({bits}-bit)\nCompression Ratio: {cr:.2f}', fontsize=20)
ax.tick_params(axis='both', labelsize=14)
ax.axis('off')
fig.suptitle('Image Quantization at Different Bit Depths', fontsize=30)
plt.tight_layout()
plt.savefig("compression_all_bits.png")
plt.show()The picture starts to get groovy by the bottom row, when the compression ratio exceed 2.00.
Please bear in mind that there are much smarter ways to compress images. The best example of this is the JPEG compression algorithm4, which would deserve its own blog post.
Applying this to audio
How is audio stored on a computer? Audio is represented as a series of amplitudes over time. This amplitude is generally represented by an integer. At the time of writing, signed 16-bit integers are the most common choice, allowing a range from

Figure code
import matplotlib.pyplot as plt
from scipy.io import wavfile
sample_rate, samples = wavfile.read("recording.wav")
plt.figure(figsize=(10, 4))
plt.plot(samples)
plt.title("Audio Waveform", fontsize=18)
plt.xlabel("Sample Index", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.tick_params(axis='both', labelsize=14)
plt.tight_layout()
plt.show()Similar to the image quantization example, where reducing the number of bits per pixel led to a loss in details, audio can also be compressed by reducing its bit depth—at the cost of quality. The bit-depth is the number of bits used to represent the amplitude of the sound at a given point in time.
The following is a short recording I made using Audacity. The original audio is in 32-bit integer format:
32-bit integers can be converted to 16-bit integers by dividing each sample by
Compression can be pushed further by converting to 8-bit integers. This reduces the range of possible values even more and achieves a 4:1 compression ratio:
As with image quantization, reducing bit depth in audio introduces quantization error. The loss of precision leads to the increasing presence of white noise, especially in the 8-bit version. This noise results from discarded detail and would dominate the signal entirely if all useful information were lost.
Taking a step back
Data compression is the practice of encoding information using fewer bits than the original representation. Quantization is the process of mapping values from a large set to a smaller set. This process compresses the data. It is, however, a type of lossy compression as information is lost through the process. Different numbers can be encoded with the same values, leading to information loss.
The information around us is generally continuous. Storing it into a computer using bits represents the first step in the quantization process. Continuous values are captured into discrete observations, stored into bits. As an example, sound is converted from a wave travelling through air into a series of integers, stored as bits.
To compress data even further, one can reduce the memory size of the integers (e.g., from 32 to 16 bits) used to encode the data. This reduces the range of possible values these integers can take, hence reducing the amount of information that can be stored. This can be done with many types of data including audio and images.
In the next post, we will explore how Neural Network weights, represented as matrices of floating point numbers, can also be quantized. As with image and audio, this results in memory gains and some performance losses. A trade-off at the centre of current Machine Learning research.
Footnotes
Gnome-fs-client.svg: David VignoniGnome-fs-server.svg: David Vignoniderivative work: Calimo, LGPL, via Wikimedia Commons↩︎
Mahdi, O.A.; Mohammed, M.A.; Mohamed, A.J. (November 2012). “Implementing a Novel Approach an Convert Audio Compression to Text Coding via Hybrid Technique” (PDF). International Journal of Computer Science Issues. 9 (6, No. 3): 53–59.↩︎
By Ams100272 - Own work, CC BY-SA 4.0, Link*↩︎
See JPEG on Wikipedia.↩︎