Quantization is the process of storing model weights with lower numerical precision.
A full-precision model might use more bits to represent each weight. A quantized model uses fewer bits, which reduces memory usage and can make inference faster or more affordable.
Why people quantize models
Quantization is especially important for local use. It can make the difference between:
- a model that only runs on a large GPU
- a model that fits on a laptop
- a model that responds too slowly to be useful
- a model that is practical for experimentation
The tradeoff
Lower precision can slightly reduce output quality, reasoning consistency, or factual reliability. The exact impact depends on the model, the quantization method, and the task.
That tradeoff is why quantization is often discussed in practical terms rather than abstract ones. The question is not “is quantization good or bad?” The better question is “what loss in quality is acceptable for this hardware budget and use case?”