[ML] bitsandbytes NF4 quantize, dequantize analysis
Analyzing bitsandbytes quantize & dequantize process with real examples
- model: polyglot-ko-5.8b (GPTNeoX)
1. bitsandbytes NF4
Information based on the paper:
NF data type builds on Quantile Quantization
[Quantile Quantization]
- information-theoretically optimal data type
- ensures each quantization bin has equal number of values assigned
- estimates the quantile through empirical cumulative distribution function
[Limitations of QQ]
- fast quantile approximation algorithms such as SRAM quantiles are used to estimate them
- due to approximate nature → has large quantization error
[Premise of NF4]
- Expensive quantile estimates can be avoided when input tensors come from distribution fixed up to a quantization constant
- Pretrained NN weights usually have zero-centered normal distribution
- can transform all weights to single fixed distribution
- proved with Shapiro-Wilk test in the paper (about 7.5% of llama is non-normally distributed)
→ NF4 sets arbitrary range [-1,1]
2. Analyzing with GPTNeoX based model
2–1. Loading the model in NF4
Loading the model with
- NF4 quantization (no double quant)
- bfloat16 computation
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
gpu_num = 0
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/polyglot-ko-5.8b",
device_map = {"": "cuda:" + str(gpu_num)},
quantization_config=bnb_config
)
Example of the loaded model weights
print(model.gpt_neox.layers[0].attention.query_key_value.weight.dtype)
# torch.uint8
print(model.gpt_neox.layers[0].attention.query_key_value.weight)
'''
Parameter containing:
Parameter(Params4bit([[103],
[184],
[110],
...,
[ 9],
[ 58],
[125]], device='cuda:0', dtype=torch.uint8))
'''
The model is loaded in uint8 data format
2–2. Analyzing Sample query_key_value weight
I’ll be dequantizing the follwoing “query_key_value” weight
sample_layer_weight = model.gpt_neox.layers[0].attention.query_key_value.weight
print("SAMPLE LAYER", sample_layer_weight.shape, sample_layer_weight.dtype)
# SAMPLE LAYER torch.Size([25165824, 1]) torch.uint8
print(sample_layer_weight.weight)
'''
Parameter containing:
Parameter(Params4bit([[103],
[184],
[110],
...,
[ 9],
[ 58],
[125]], device='cuda:0', dtype=torch.uint8))
'''
The shape of the quantized layer is [25165824, 1]. It represents the following
- dtype is torch.uint8 → each tensor is packing two 4-bit values
- therefore there are 2*25165824 → 50,331,648 number of tensors
- this can be represented as 3 * 4096 * 4096
- since GPTNeoX’s query_key_value weight has shape (3*hidden_dim, hidden_dim)
- this means the values in the weight is quantized & squeezed into a 1d tensor
Lets see how the values are packed into a 8-bit uint tensor
print([v[0] for v in sample_layer_weight[:2].cpu().numpy().tolist()])
# [103, 184]
print(format(sample_layer_weight[0].item(), '08b'))
# 01100111
print(format(sample_layer_weight[1].item(), '08b'))
# 10111000
The first uint tensor “01100111” can be split as “0110” (6) and “0111” (7).
2–3. Dequantizing Sample query_key_value weight
To dequantize the tensor, we need the “quant_state” which contains the following
quant_state = sample_layer_weight.quant_state
absmax, shape, dtype, blocksize, compressed_stats, quant_type, data_type = quant_state
print("absmax",absmax,absmax.shape)
# absmax tensor([0.0492, 0.0487, 0.0561, ..., 0.0257, 0.0217, 0.0180], device='cuda:0') torch.Size([786432])
print("shape",shape) # (3*4096,4096) -> original shape
print("dtype",dtype) # dtype torch.float16
print("blocksize",blocksize) # 64
print("quant_type",quant_type) # quant_type nf4
print("data_type",data_type, len(data_type))
'''
data_type tensor([-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0000,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0000],
device='cuda:0') 16
'''
shape, dtype are information about the dequantized tensor
the shape of “absmax” can be interpreted as follows:
- absmax represents the absolute maximum values of each block
- absmax size * blocksize = 2 * quantized tensor dimension
- 64 * 786432 = 50331648 = 2* 25165824
data_type represents the threshold for each quantized value bins
- ex. if scaled original tensor is -0.5251< ≤-0.3949
- → it belongs in bin 3
sample code of binning:
- you can see the mapping at dQuantizeNF4 function in https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/csrc/kernels.cu#L830
if(x > 0.03979014977812767f)
if(x > 0.3893125355243683f) // 1
if(x > 0.6427869200706482f) // 11
if(x > 0.8614784181118011f) // 111
return 0b1111;
else
return 0b1110;
...
I’ll be dequantizing the weight using bitsandbytes’ function
import bitsandbytes.functional as F
dequantized = F.dequantize_4bit(sample_layer_weight, quant_state).to(torch.bfloat16)
print(dequantized.shape) # (3*4096, 4096) - (3*hidden_size, hidden_size)
# torch.Size([12288, 4096])
print(dequantized[0])
'''
tensor([-0.0045, 0.0000, 0.0166, ..., 0.0703, 0.0239, 0.0239],
device='cuda:0', dtype=torch.bfloat16)
'''
3. Reproducing the Quantize, Dequantize process
I’ll be requantizing → dequantizing with the dequantized weight from 2–3
The functions implement the block-quantization equation from the paper:
Since NF4 uses range [-1, 1] 127 is replaced with 1
Quantizing the first, second values: [-0.0045, 0.0000]:
- they are quantized as a uint8 tensor of 01100111
- 0110: 6, 0111: 7
First the weight block that contains the weights and the absolute max value of the block are as follows:
def get_absmax(x):
return max(abs(x))
weight_block = dequantized[0][:blocksize].clone().cpu()
absmax = get_absmax(weight_block)
print("absmax of block:", absmax)
# absmax of block: tensor(0.0491, dtype=torch.bfloat16)
quantizing the weights:
get_quantile uses argmin of absolute difference below since it was already restored from being quantized (just for simplicity)
- quantile 0: ≤ -0.6962
- quantile 1: -0.6962 < ≤ -0.5251
- ..
- quantile 6: -0.1848< ≤ -0.0911
def get_quantile(x, data_type):
return np.argmin([abs(dt-x) for dt in data_type])
def simple_quant(x, absmax, data_type):
c = 1/absmax
scaled = x*c
q = get_quantile(scaled, data_type)
return q
'''
data_type_np: [-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0000, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0000]
'''
quantized1 = simple_quant(weight_block[0], absmax, data_type_np)
# c = 1/absmax: 20.3750
# scaled: -0.0913
# quantized: 6
quantized2 = simple_quant(weight_block[1], absmax, data_type_np)
# c = 1/absmax: 20.3750
# scaled: 0.
# quantized: 7
Looking at the scaled value -0.0913 we can see that it equals to quantile 6
- it is close to -0.911 but not exactly same due to the implementation
this is equal to the original quantized weight “0110”
Then dequantizing the weight is as follows:
def simple_dequant(x_q, absmax, data_type):
dq = data_type[x_q]
c = 1/absmax
return dq/c
dequantized = simple_dequant(quantized1, absmax, data_type_np)
# dq: data_type[6] -> -0.0911
# c: 20.3750
# DEQUANTIZED tensor(-0.0045)
# Before Quantization: -0.0045
dequantized = simple_dequant(quantized1, absmax, data_type_np)
# dq: data_type[7] -> 0.0000
# c: 20.3750
# DEQUANTIZED tensor(0.)
# Before Quantization: 0.
Sources:
- https://github.com/TimDettmers/bitsandbytes
- matmul_4bit function: https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/autograd/_functions.py#L566
- quantize_4bit function: https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L775
- dQuantizeNF4 function: https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/csrc/kernels.cu#L277
- https://huggingface.co/docs/transformers/main_classes/quantization#bitsandbytes-integration
- QLORA: Efficient Finetuning of Quantized LLMs — Dettmers et al. https://arxiv.org/pdf/2305.14314.pdf