Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/1_bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ labels: bug
assignees: ''
---

**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues?q=is%3Aissue).**
**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/Model-Optimizer/issues?q=is%3Aissue).**

## Describe the bug
<!-- Description of what the bug is, its impact (blocker, should have, nice to have) and any stack traces or error messages. -->
Expand All @@ -30,7 +30,7 @@ If you are unsure about whom to tag, you can leave it blank, and we will make su

## System information

<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->

- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/3_question.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ labels: question
assignees: ''
---

Make sure you already checked the [examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples) and [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) before submitting an issue.
Make sure you already checked the [examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples) and [documentation](https://nvidia.github.io/Model-Optimizer/) before submitting an issue.

## How would you like to use ModelOpt

Expand All @@ -23,7 +23,7 @@ If you are unsure about whom to tag, you can leave it blank, and we will make su

## System information

<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->

- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
Expand Down
4 changes: 2 additions & 2 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open `Draft` PR. -->

- **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed.
- **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. -->
- **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->
17 changes: 8 additions & 9 deletions CHANGELOG-Windows.rst
Original file line number Diff line number Diff line change
@@ -1,34 +1,33 @@
===================================
Model Optimizer Changelog (Windows)
===================================
NVIDIA Model Optimizer Changelog (Windows)
==========================================

0.33 (2025-07-21)
^^^^^^^^^^^^^^^^^

**New Features**

- TensorRT Model Optimizer for Windows now supports `NvTensorRtRtx <https://onnxruntime.ai/docs/execution-providers/TensorRTRTX-ExecutionProvider.html>`_ execution-provider.
- Model Optimizer for Windows now supports `NvTensorRtRtx <https://onnxruntime.ai/docs/execution-providers/TensorRTRTX-ExecutionProvider.html>`_ execution-provider.


0.27 (2025-04-30)
^^^^^^^^^^^^^^^^^

**New Features**

- New LLM models like DeepSeek etc. are supported with ONNX INT4 AWQ quantization on Windows. Refer `Windows Support Matrix <https://nvidia.github.io/TensorRT-Model-Optimizer/guides/0_support_matrix.html>`_ for details about supported features and models.
- TensorRT Model Optimizer for Windows now supports ONNX INT8 and FP8 quantization (W8A8) of SAM2 and Whisper models. Check `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq>`_ for getting started with quantizing these models.
- New LLM models like DeepSeek etc. are supported with ONNX INT4 AWQ quantization on Windows. Refer `Windows Support Matrix <https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html>`_ for details about supported features and models.
- Model Optimizer for Windows now supports ONNX INT8 and FP8 quantization (W8A8) of SAM2 and Whisper models. Check `example scripts <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/onnx_ptq>`_ for getting started with quantizing these models.


0.19 (2024-11-18)
^^^^^^^^^^^^^^^^^

**New Features**

- This is the first official release of TensorRT Model Optimizer for Windows
- This is the first official release of Model Optimizer for Windows
- **ONNX INT4 Quantization:** :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models.
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_
- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`DirectML_Deployment`.
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.


Expand Down
32 changes: 16 additions & 16 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Model Optimizer Changelog (Linux)
=================================
NVIDIA Model Optimizer Changelog (Linux)
========================================

0.40 (2025-12-11)
^^^^^^^^^^^^^^^^^
Expand All @@ -12,7 +12,7 @@ Model Optimizer Changelog (Linux)
**New Features**

- Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for ``num_moe_experts``, ``moe_ffn_hidden_size`` and ``moe_shared_expert_intermediate_size`` parameters in Minitron pruning (``mcore_minitron``).
- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
- Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
- Add flag ``trt_plugins_precision`` in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
- Add support for PyTorch Geometric quantization.
Expand All @@ -30,7 +30,7 @@ Model Optimizer Changelog (Linux)

**Deprecations**

- Deprecated ``modelopt.torch._deploy.utils.get_onnx_bytes`` API. Please use ``modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata`` instead to access the ONNX model bytes with external data. see `examples/onnx_ptq/download_example_onnx.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/onnx_ptq/download_example_onnx.py>`_ for example usage.
- Deprecated ``modelopt.torch._deploy.utils.get_onnx_bytes`` API. Please use ``modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata`` instead to access the ONNX model bytes with external data. see `examples/onnx_ptq/download_example_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/onnx_ptq/download_example_onnx.py>`_ for example usage.

**New Features**

Expand All @@ -40,7 +40,7 @@ Model Optimizer Changelog (Linux)
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` (gated dataset accessed using ``HF_TOKEN`` environment variable) if no dataset is specified.
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
- Add support for MCore MoE PTQ/QAT/QAD.
- Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
- Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
- Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
- Add flags ``nodes_to_include`` and ``op_types_to_include`` in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
- Add support for ``torch.compile`` and benchmarking in ``examples/diffusers/quantization/diffusion_trt.py``.
Expand All @@ -51,15 +51,15 @@ Model Optimizer Changelog (Linux)

**Documentation**

- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
- Added example for exporting QLoRA checkpoint for vLLM deployment. Refer to `examples/llm_qat/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/79ef31bc7269ba4da0cfab446da5b64509cbfcef/examples/llm_qat/README.md#qlora-deployment>`_ for more details
- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
- Added example for exporting QLoRA checkpoint for vLLM deployment. Refer to `examples/llm_qat/README.md <https://github.com/NVIDIA/Model-Optimizer/blob/79ef31bc7269ba4da0cfab446da5b64509cbfcef/examples/llm_qat/README.md#qlora-deployment>`_ for more details

0.37 (2025-10-08)
^^^^^^^^^^^^^^^^^

**Deprecations**

- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM or TensorRT docker image directly or refer to the `installation guide <https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html>`_ for more details.
- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM or TensorRT docker image directly or refer to the `installation guide <https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html>`_ for more details.
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. ``engine_dir`` is replaced with ``checkpoint_dir`` in ``examples/llm_ptq`` and ``examples/vlm_ptq``. For performance evaluation, please use ``trtllm-bench`` directly.
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
Expand Down Expand Up @@ -226,16 +226,16 @@ Model Optimizer Changelog (Linux)
- Disabled saving modelopt state in unified hf export APIs by default, i.e., added ``save_modelopt_state`` flag in ``export_hf_checkpoint`` API and by default set to False.
- Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
- The :class:`modelopt.deploy.llm.LLM` now support use the :class:`tensorrt_llm._torch.LLM` backend for the quantized HuggingFace checkpoints.
- Add `NVFP4 PTQ example for DeepSeek-R1 <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/deepseek>`_.
- Add end-to-end `AutoDeploy example for AutoQuant LLM models <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_autodeploy>`_.
- Add `NVFP4 PTQ example for DeepSeek-R1 <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/deepseek>`_.
- Add end-to-end `AutoDeploy example for AutoQuant LLM models <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_autodeploy>`_.

0.23 (2025-01-29)
^^^^^^^^^^^^^^^^^

**Backward Breaking Changes**

- Support TensorRT-LLM to 0.17. Examples (e.g. benchmark task in llm_ptq) may not be fully compatible with TensorRT-LLM 0.15.
- Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
- Nvidia Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
- Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
- ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
- In the Huggingface examples, the ``trust_remote_code`` is by default set to false and require users to explicitly turning it on with ``--trust_remote_code`` flag.
Expand Down Expand Up @@ -283,7 +283,7 @@ Model Optimizer Changelog (Linux)
**Backward Breaking Changes**

- Deprecated the summarize task in the ``llm_ptq`` example.
- Deprecated the ``type`` flag in the `huggingface_example.sh <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq/scripts/huggingface_example.sh>`_
- Deprecated the ``type`` flag in the `huggingface_example.sh <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq/scripts/huggingface_example.sh>`_
- Deprecated Python plugin support in ONNX.
- Support TensorRT-LLM 0.13. Examples not compatible with TensorRT-LLM 0.12.
- :meth:`mtq.auto_quantize <modelopt.torch.quantization.model_quant.auto_quantize>` API has been updated. The API now
Expand Down Expand Up @@ -320,7 +320,7 @@ Model Optimizer Changelog (Linux)
- New APIs and examples: :mod:`modelopt.torch.prune` for pruning Conv, Linear, and Attention heads for
NVIDIA Megatron-core GPT-style models (e.g. Llama 3), PyTorch Computer Vision models, and HuggingFace Bert/GPT-J models.
- New API: :mod:`modelopt.torch.distill` for knowledge distillation, along with guides and example.
- New Example: `HF BERT Prune, Distill & Quantize <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/chained_optimizations>`_
- New Example: `HF BERT Prune, Distill & Quantize <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/chained_optimizations>`_
showcasing how to chain pruning, distillation, and quantization to achieve the best performance on a given model.
- Added INT8/FP8 DQ-only support for ONNX model.
- New API: :mod:`modelopt.torch.speculative` for end-to-end support of Medusa models.
Expand Down Expand Up @@ -383,13 +383,13 @@ Model Optimizer Changelog (Linux)

**Backward Breaking Changes**

- `PTQ examples <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq>`_ have been
- `PTQ examples <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq>`_ have been
upgraded to use TensorRT-LLM 0.10.

**New Features**

- Adding TensorRT-LLM checkpoint export support for Medusa decoding (official ``MedusaModel`` and Megatron Core ``GPTModel``).
- Enable support for mixtral, recurrentgemma, starcoder, qwen in `PTQ examples <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq>`_.
- Enable support for mixtral, recurrentgemma, starcoder, qwen in `PTQ examples <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq>`_.
- Adding TensorRT-LLM checkpoint export and engine building support for sparse models.
- Import scales from TensorRT calibration cache and use them for quantization.
- (Experimental) Enable low GPU memory FP8 calibration for the Hugging Face models when the original model size does not fit into the GPU memory.
Expand All @@ -403,7 +403,7 @@ Model Optimizer Changelog (Linux)
**Backward Breaking Changes**

- [!!!] The package was renamed from ``ammo`` to ``modelopt``. The new full product
name is *Nvidia TensorRT Model Optimizer*. PLEASE CHANGE ALL YOUR REFERENCES FROM ``ammo`` to
name is *Nvidia Model Optimizer*. PLEASE CHANGE ALL YOUR REFERENCES FROM ``ammo`` to
``modelopt`` including any paths and links!
- Default installation ``pip install nvidia-modelopt`` will now only install minimal core
dependencies. Following optional dependencies are available depending on the features that are
Expand Down
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Contributing to TensorRT Model Optimizer
# Contributing to Model Optimizer

Thanks for your interest in contributing to TensorRT Model Optimizer (ModelOpt)!
Thanks for your interest in contributing to Model Optimizer (ModelOpt)!

## 🛠️ Setting up your environment

Ensure that TensorRT Model Optimizer (ModelOpt) is installed in editable mode and that all `dev` optional requirements are installed:
Ensure that Model Optimizer (ModelOpt) is installed in editable mode and that all `dev` optional requirements are installed:

```bash
pip install -e ".[dev]"
```

If you are working on features that require dependencies like TensorRT-LLM or Megatron-Core, consider using a docker container to simplify the setup process.
Visit our [installation docs](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more information.
Visit our [installation docs](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for more information.

## 🧹 Code linting and formatting

Expand Down
Loading