Skip to content

CUDA Out of memory when mqt.compress model after QAT #696

@alrolo3

Description

@alrolo3

Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.

Describe the bug

I am trying to apply LoRA before QAT and, after that, QST + compress weights (as far as I now, that is how I need to implement QLoRA)

But if I compress weights: CUDA Out of memory when mqt.compress model after QAT.
(Model is gpt-oss-120b)

from trl import SFTTrainer
from peft import LoraConfig

peft_config = LoraConfig(r=256, lora_alpha=16, target_modules="all-linear")

trainer = SFTTrainer(
peft_config=peft_config,
model=model,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split],
processing_class=tokenizer,
)
import torch

import modelopt.torch.quantization as mtq

quantization_config = mtq.MXFP4_MLP_WEIGHT_ONLY_CFG
calib_size = 128

dataset = torch.utils.data.Subset(
trainer.eval_dataset, list(range(min(len(trainer.eval_dataset), calib_size)))
)
data_loader = trainer.get_eval_dataloader(dataset)

def forward_loop(model):
for data in data_loader:
model(**data)

q_model = mtq.quantize(model, quantization_config, forward_loop)
qc_model = mtq.compress(model)
trainer.train()

System information

  • Container used (if applicable): ?
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04.5 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA A100-SXM4-80GB
  • GPU memory size: 80.0 GB
  • Number of GPUs: 8
  • Library versions (if applicable):
    • Python: 3.10.12
    • ModelOpt version or commit hash: 0.40.0
    • CUDA: 12.2
    • PyTorch: 2.6.0+cu124
    • Transformers: 4.57.3
    • TensorRT-LLM: ?
    • ONNXRuntime: 1.22.0
    • TensorRT: ?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions