GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Published in , 2024
Summarization of GPTQ
Optimal Brain Suargeon vis Lagrangian multipliers
If we set local approximation cost function as $\mathcal{C}_{\text{av}}$, using Talyor series about the operating point, we can describe as:
\[\mathcal{C}_{\text{av}}(\mathbf{w} + \Delta \mathbf{w}) = \mathcal{C}_{\text{av}}(\mathbf{w}) + \nabla \mathcal{C}_{\text{av}} (\mathbf{w}) \Delta \mathbf{w} + \frac{1}{2} \Delta \mathbf{w}^T \nabla^2 \mathcal{C}_{\text{av}}(\mathbf{w}) \Delta \mathbf{w} + O(\|\Delta \mathbf{w}\|^3)\]If we do external Approximation, $\nabla \mathcal{C}_{\text{av}}(\mathbf{w})$ will be 0, which we can ignore this. Therefore we can approximated simpaly as:
\[\Delta \mathcal{C}_{\text{av}} = \mathcal{C}_{\text{av}}(\mathbf{w} + \Delta \mathbf{w}) - \mathcal{C}_{\text{av}}(\mathbf{w}) \simeq \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H} \Delta \mathbf{w}\]becuase
\[\mathbf{H} = \nabla^2 \mathcal{C}_{\text{av}}(\mathbf{w})\]If we set one weights to zero, this elmination is equivalent to condition,
\[\Delta w_i + w_i = 0\]equal
\[\mathbf{I}_i^T \Delta \mathbf{w} + w_i = 0\]than from lagrangian multipliers, we know,
\[L = \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H} \Delta \mathbf{w} + \lambda (\mathbf{I}_i^T \Delta \mathbf{w} + w_i)\] \[\frac{\partial}{\partial \Delta \mathbf{w}} \left( \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H} \Delta \mathbf{w} - \lambda (\mathbf{I}_i^T \Delta \mathbf{w} + w_i) \right) = 0\] \[\frac{\partial}{\partial \Delta \mathbf{w}} \left( \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H} \Delta \mathbf{w} \right) - \frac{\partial}{\partial \Delta \mathbf{w}} \left( \lambda (\mathbf{I}_i^T \Delta \mathbf{w} + w_i) \right) = 0\] \[\mathbf{H} \Delta \mathbf{w} - \lambda \mathbf{I}_i = 0\] \[\mathbf{H} \Delta \mathbf{w} = \lambda \mathbf{I}_i\] \[\Delta \mathbf{w} = \lambda \mathbf{H}^{-1} \mathbf{I}_i\] \[\mathbf{I}_i^T \Delta \mathbf{w} + w_i = 0\] \[\mathbf{I}_i^T (\lambda \mathbf{H}^{-1} \mathbf{I}_i) + w_i = 0\] \[\lambda [\mathbf{H}^{-1}]_{ii} + w_i = 0\] \[\lambda = -\frac{w_i}{[\mathbf{H}^{-1}]_{ii}}\] \[\Delta \mathbf{w} = -\frac{w_i}{[\mathbf{H}^{-1}]_{ii}} \mathbf{H}^{-1} \mathbf{I}_i\] \[L_i = \frac{w_i^2}{2 [\mathbf{H}^{-1}]_{ii}}\]