What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA?
#13507
-
|
The struct of q8_q_mmq is: struct block_q8_1_mmq { I was wondering why do we need this |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
|
The quantization used for A is decoded as |
Beta Was this translation helpful? Give feedback.
The quantization used for A is decoded as
Ad*a - Amwhere Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded asBd*b. So the matrix multiply dots a row of A and column of B, computingsum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it asAd*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is thissum{b}term, precomputed to make the matrix multiply faster.