-
Notifications
You must be signed in to change notification settings - Fork 222
Description
Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.
Describe the bug
I am trying to apply LoRA before QAT and, after that, QST + compress weights (as far as I now, that is how I need to implement QLoRA)
But if I compress weights: CUDA Out of memory when mqt.compress model after QAT.
(Model is gpt-oss-120b)
from trl import SFTTrainer
from peft import LoraConfig
peft_config = LoraConfig(r=256, lora_alpha=16, target_modules="all-linear")
trainer = SFTTrainer(
peft_config=peft_config,
model=model,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split],
processing_class=tokenizer,
)
import torch
import modelopt.torch.quantization as mtq
quantization_config = mtq.MXFP4_MLP_WEIGHT_ONLY_CFG
calib_size = 128
dataset = torch.utils.data.Subset(
trainer.eval_dataset, list(range(min(len(trainer.eval_dataset), calib_size)))
)
data_loader = trainer.get_eval_dataloader(dataset)
def forward_loop(model):
for data in data_loader:
model(**data)
q_model = mtq.quantize(model, quantization_config, forward_loop)
qc_model = mtq.compress(model)
trainer.train()
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04.5 LTS
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): NVIDIA A100-SXM4-80GB
- GPU memory size: 80.0 GB
- Number of GPUs: 8
- Library versions (if applicable):
- Python: 3.10.12
- ModelOpt version or commit hash: 0.40.0
- CUDA: 12.2
- PyTorch: 2.6.0+cu124
- Transformers: 4.57.3
- TensorRT-LLM: ?
- ONNXRuntime: 1.22.0
- TensorRT: ?