0% found this document useful (0 votes)
28 views6 pages

5 Low Bit Quantization 1

The document discusses low-bit quantization for deep learning models, highlighting its importance in reducing memory usage and improving inference speed. It covers various quantization methods, including post-training quantization and quantization aware training, as well as low-bit model architectures like Binary Neural Networks and Ternary Weight Networks. Additionally, it emphasizes the need for specialized hardware and software optimizations to support low-bit operations.

Uploaded by

shubham jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

5 Low Bit Quantization 1

The document discusses low-bit quantization for deep learning models, highlighting its importance in reducing memory usage and improving inference speed. It covers various quantization methods, including post-training quantization and quantization aware training, as well as low-bit model architectures like Binary Neural Networks and Ternary Weight Networks. Additionally, it emphasizes the need for specialized hardware and software optimizations to support low-bit operations.

Uploaded by

shubham jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Low-bit Quantization

for Deep Learning Models

PD Dr. Haojin Yang


Multimedia and Machine Learning Group
Hasso Plattner Institute
Low-bit Quantization

• Neural network consists of floating-point operations and parameters


• E.g., FP32 (32-bit) with the range [(2-2-23) x 2127, (223 -2) x 2127], the number of possible values
is approximately 232.

• Quantization in digital signal processing refers to approximating the continuous value of the
signal to a finite number of discrete values.
• Neural network quantization refers to the use of low-bit values and operations instead of
full-precision counterparts.
• E.g., A fixed-point expression e.g., INT8 (8-bit) with the range [-128, 127], the number of possible
values is approximately 28

23.04.2025 Efficient AI Techniques in the LLM Era Gregory Maxwell at English Wikipedia, CC BY-SA 3.0 2
Neural Network Quantization

• Why does quantization work for deep neural networks?


• Deep neural networks are likely overparameterized.
• The neural network's weights have a narrow distribution range and are close to
zero.
• Advantages of neural network quantization
• Significantly save memory and improve inference speed
• Support more applications of edge devices
• Type of quantization methods
• Post-training quantization (PTQ)
• Quantization aware training (QAT)

23.04.2025 Efficient AI Techniques in the LLM Era 3


Low-bit Model Architectures

• Binary Neural Networks (BNNs): Using 1-bit


weights and activations

• Ternary Weight Networks (TWNs): Using -1,


0, +1 weight values

• Quantized Neural Networks (QNNs): 2-8 bit


precision models

• Mixed-precision architectures: Different bit-


widths for different layers

23.04.2025 Efficient AI Techniques in the LLM Era Image source: [Link] 4


Computing Engines and Optimizers

• Specialized hardware accelerators for low-bit operations (TPUs, NPUs)


• Software frameworks optimized for quantized computations
• Bit-serial computation techniques for flexible precision
• Energy efficiency gains through custom computing engines
• Training optimizers designed for low-precision gradients
• Memory bandwidth reduction through computation-in-memory approaches

Google’s Cloud TPU Nvidia GPU

23.04.2025 Efficient AI Techniques in the LLM Era 5


Quantization Aware Training

• Ultra-low bit quantization (< 8-bit) will cause significant precision drop.
• Train a neural network using quantized weights and activations
• Upcoming video: We will explain how do we train binary neural networks (1-bit).

round((2𝑘 − 1) ∙ 𝑥)
Forward: 𝑟𝑜 =
2𝑘 − 1
𝜕𝑐 𝜕𝑐
Backward: =
𝜕𝑟𝑖 𝜕𝑟𝑜

23.04.2025 Efficient AI Techniques in the LLM Era 6

You might also like