Dynamic Quantization: A Tailored Approach

The Unsloth AI team’s approach involves dynamic quantization, where variable bit precisions are assigned based on the sensitivity of different network components. Key technical insights include:

Selective Precision Assignment

The initial dense layers and the down-projection (down_proj) matrices, critical for setting up stable representations and managing the scaling properties in SwiGLU activations, are maintained at higher precisions (4-bit or 6-bit). Conversely, the bulk of the parameters — primarily within the Mixture-of-Experts (MoE) layers, which constitute about 88% of the model — are quantized aggressively to 1.5–2 bits.

Importance Matrix Calibration

Incorporating an importance matrix during the quantization process allows the method to dynamically adjust precision levels per layer. This calibration prevents common pitfalls such as infinite loops or nonsensical outputs that typically arise from uniform quantization.

Layer-Specific Sensitivity Analysis

Technical evaluations indicated that while MoE layers tolerate lower precision, components like the attention mechanisms, embedding layers, and final output heads require more bits to preserve activation distributions. This nuanced strategy ensures that critical paths in the computation graph retain sufficient fidelity.

Overview of the DeepSeek-R1 model architecture with dynamic quantization.