Model Quantization

Definition

Reducing a neural network's numerical precision (e.g., from 32-bit to 8-bit or 4-bit) to decrease model size and increase inference speed.

Model quantization converts a neural network's weights and activations from high-precision floating-point numbers (FP32) to lower-precision formats (INT8, INT4). This typically reduces model size by 2-8x and increases inference speed by 2-4x, with minimal accuracy loss when done carefully.

Quantization is essential for on-device deployment where memory and compute are limited. A 1.5GB FP32 model becomes ~375MB at INT8 precision, making it feasible to run on a laptop or phone. Post-training quantization applies the conversion after training, while quantization-aware training incorporates quantization during training for better accuracy preservation.

Related Terms

Related Content