Low-bit Quantization
for Deep Learning Models
PD Dr. Haojin Yang
Multimedia and Machine Learning Group
Hasso Plattner Institute
Low-bit Quantization
• Neural network consists of floating-point operations and parameters
• E.g., FP32 (32-bit) with the range [(2-2-23) x 2127, (223 -2) x 2127], the number of possible values
is approximately 232.
• Quantization in digital signal processing refers to approximating the continuous value of the
signal to a finite number of discrete values.
• Neural network quantization refers to the use of low-bit values and operations instead of
full-precision counterparts.
• E.g., A fixed-point expression e.g., INT8 (8-bit) with the range [-128, 127], the number of possible
values is approximately 28
23.04.2025 Efficient AI Techniques in the LLM Era Gregory Maxwell at English Wikipedia, CC BY-SA 3.0 2
Neural Network Quantization
• Why does quantization work for deep neural networks?
• Deep neural networks are likely overparameterized.
• The neural network's weights have a narrow distribution range and are close to
zero.
• Advantages of neural network quantization
• Significantly save memory and improve inference speed
• Support more applications of edge devices
• Type of quantization methods
• Post-training quantization (PTQ)
• Quantization aware training (QAT)
23.04.2025 Efficient AI Techniques in the LLM Era 3
Low-bit Model Architectures
• Binary Neural Networks (BNNs): Using 1-bit
weights and activations
• Ternary Weight Networks (TWNs): Using -1,
0, +1 weight values
• Quantized Neural Networks (QNNs): 2-8 bit
precision models
• Mixed-precision architectures: Different bit-
widths for different layers
23.04.2025 Efficient AI Techniques in the LLM Era Image source: [Link] 4
Computing Engines and Optimizers
• Specialized hardware accelerators for low-bit operations (TPUs, NPUs)
• Software frameworks optimized for quantized computations
• Bit-serial computation techniques for flexible precision
• Energy efficiency gains through custom computing engines
• Training optimizers designed for low-precision gradients
• Memory bandwidth reduction through computation-in-memory approaches
Google’s Cloud TPU Nvidia GPU
23.04.2025 Efficient AI Techniques in the LLM Era 5
Quantization Aware Training
• Ultra-low bit quantization (< 8-bit) will cause significant precision drop.
• Train a neural network using quantized weights and activations
• Upcoming video: We will explain how do we train binary neural networks (1-bit).
round((2𝑘 − 1) ∙ 𝑥)
Forward: 𝑟𝑜 =
2𝑘 − 1
𝜕𝑐 𝜕𝑐
Backward: =
𝜕𝑟𝑖 𝜕𝑟𝑜
23.04.2025 Efficient AI Techniques in the LLM Era 6