[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch by gcunhase · Pull Request #998 · NVIDIA/Model-Optimizer

gcunhase · 2026-03-06T19:05:10Z

What does this PR do?

Type of change: Bug fix

Overview: Replace CUDA memory management from CUDART to PyTorch (higher-level API).

Usage

# Add a code snippet demonstrating how to use this

Testing

Added unittests.
Tested that this PR does not break [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951 or [5951713] Fix benchmark allocation failure #978

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: N/A

Additional Information

Summary of changes in benchmark.py — TensorRTPyBenchmark:

What changed	Before	After
Imports	`contextlib` + `from cuda.bindings import runtime as cudart`	`import torch` (conditional)
Availability flag	`CUDART_AVAILABLE`	`TORCH_CUDA_AVAILABLE = torch.cuda.is_available()`
`__init__` guard	checks `CUDART_AVAILABLE or cudart is None`	checks `TORCH_CUDA_AVAILABLE`
`_alloc_pinned_host`	`cudaMallocHost` + ctypes address hack, returns `(ptr, arr, err)`	`torch.empty(...).pin_memory()`, returns `(tensor, tensor.numpy())`
`_free_buffers`	`cudaFreeHost` + `cudaFree` per buffer	`bufs.clear()` — PyTorch GC handles deallocation
`_allocate_buffers`	raw `device_ptr` integers, error-code returns	`torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address
`_run_warmup`	`cudaMemcpyAsync` + `cudaStreamSynchronize`	`tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()`
`_run_timing`	same cudart pattern	same torch pattern
`run` — stream lifecycle	`cudaStreamCreate()` / `cudaStreamDestroy()`	`torch.cuda.Stream()` / `del stream`
`run` — stream arg to TRT	raw integer handle	`stream.cuda_stream` (integer property)
Error handling	`cudaError_t` return codes	PyTorch raises `RuntimeError`, caught by existing `except Exception`

Related to #961

Summary by CodeRabbit

Refactor
- TensorRT benchmarking migrated from direct CUDA runtime calls to PyTorch CUDA tensors, pinned memory, and CUDA stream primitives — simplifying buffer management, transfers, and timing semantics.
Tests
- Expanded GPU autotune benchmark tests with broader unit and integration coverage for CUDA/TensorRT paths, pinned-host/device buffering, stream behavior, warmup/timing, and end-to-end latency scenarios.

coderabbitai · 2026-03-06T19:05:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaced cudart usage in the TensorRT Python benchmark with PyTorch CUDA primitives: pinned host tensors, CUDA device tensors, and torch.cuda.Stream. Buffer allocation, host↔device transfers, stream handling, and CUDA availability checks now use torch.cuda and tensor.data_ptr() for TensorRT execution.

Changes

Cohort / File(s)	Summary
TensorRT Python benchmark `modelopt/onnx/quantization/autotune/benchmark.py`	Converted TensorRTPyBenchmark from cudart to PyTorch CUDA: `_alloc_pinned_host` now returns a pinned torch tensor and numpy view; `_allocate_buffers` returns lists of dicts `{"host":..., "device":..., "name":...}` using torch tensors; `_free_buffers` clears refs; stream params use `torch.cuda.Stream`; execution uses `context.execute_async_v3` with `.data_ptr()` and non-blocking tensor copies. Removed cudart availability and error-code handling; updated signatures and docs.
Autotune benchmark tests `tests/gpu/onnx/quantization/autotune/test_benchmark.py`	Expanded and refactored tests: added fixtures and helpers (`trtexec_bench`, `_make_bench`, `_make_buffers`) to mock TensorRT/CUDA flows and produce PyTorch-pinned buffers; added unit and integration tests for shape validation, pinned allocation, buffer free, CUDA stream warmup/timing, and trtexec subprocess behaviors; adjusted guards to rely on `torch.cuda` availability.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host as Host (CPU)
    participant PyTorch as PyTorch CUDA
    participant TRT as TensorRT ExecutionContext
    participant GPU as GPU (CUDA)

    Host->>PyTorch: allocate pinned host tensor (pin_memory=True)
    PyTorch->>GPU: create device tensor (torch.empty(..., device='cuda'))
    Host->>PyTorch: non-blocking copy host -> device (tensor.to(device, non_blocking=True))
    PyTorch->>TRT: execute_async_v3(binding_ptrs using data_ptr(), stream=stream)
    TRT->>GPU: run kernels
    GPU->>PyTorch: non-blocking copy device -> host (tensor.copy_(..., non_blocking=True))
    PyTorch->>Host: synchronize stream and return results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	PR introduces three # nosec comments in benchmark.py to bypass Bandit security checks, violating SECURITY.md guidelines that explicitly prohibit this practice.	Remove all # nosec comments and follow required approval process: add inline comments explaining security-sensitive patterns, add justification in PR description, and request `@NVIDIA/modelopt-setup-codeowners` review.
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly describes the main change: replacing CUDA memory management from CUDART to PyTorch in the ONNX autotune module, which is the primary focus of the changeset.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

modelopt/onnx/quantization/autotune/benchmark.py (2)

567-569: Duplicate dtype conversion pattern.

The torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype pattern is duplicated here and in _alloc_pinned_host. Consider extracting to a helper function or using the mapping suggested earlier.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 567 - 569, The
duplicated dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
used here and inside _alloc_pinned_host should be replaced with a shared helper
(e.g., a function like np_to_torch_dtype) or the existing dtype mapping to
centralize numpy->torch dtype conversion; add that helper (or mapping) and call
it from both the _alloc_pinned_host implementation and the benchmark allocation
site to return torch_dtype for a given np_dtype, then use the returned
torch_dtype when creating device_tensor.

535-537: Consider using torch.as_tensor or a dtype mapping for cleaner conversion.

The current approach creates temporary numpy arrays just to derive the torch dtype. A direct mapping or torch.as_tensor would be more efficient, though the current approach is functionally correct.

♻️ Optional: Cleaner dtype conversion

+# At module level or as class attribute
+_NP_TO_TORCH_DTYPE = {
+    np.float32: torch.float32,
+    np.float16: torch.float16,
+    np.int32: torch.int32,
+    np.int64: torch.int64,
+    np.int8: torch.int8,
+    np.uint8: torch.uint8,
+    np.bool_: torch.bool,
+}

 `@staticmethod`
 def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
     """Allocate pinned host memory using PyTorch and return (tensor, numpy_view)."""
-    torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+    torch_dtype = _NP_TO_TORCH_DTYPE.get(dtype, torch.from_numpy(np.empty(0, dtype=dtype)).dtype)
     host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
     return host_tensor, host_tensor.numpy()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 535 - 537,
Replace the temporary numpy-to-torch conversion by mapping the incoming numpy
dtype to a torch dtype (or use torch.as_tensor) to avoid creating empty numpy
arrays; specifically, replace the expression that computes torch_dtype
(currently torch.from_numpy(np.empty(0, dtype=dtype)).dtype) with a direct
mapping like NUMPY_TO_TORCH_DTYPE[dtype] or obtain torch dtype via
torch.as_tensor(...).dtype, then create host_tensor = torch.empty(int(size),
dtype=torch_dtype).pin_memory() and return host_tensor, host_tensor.numpy();
update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or
get_torch_dtype_from_numpy) to centralize dtype mapping and reference the
variables torch_dtype, host_tensor, dtype, and size so the change is local and
clear.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 567-569: The duplicated dtype conversion
torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype used here and inside
_alloc_pinned_host should be replaced with a shared helper (e.g., a function
like np_to_torch_dtype) or the existing dtype mapping to centralize numpy->torch
dtype conversion; add that helper (or mapping) and call it from both the
_alloc_pinned_host implementation and the benchmark allocation site to return
torch_dtype for a given np_dtype, then use the returned torch_dtype when
creating device_tensor.
- Around line 535-537: Replace the temporary numpy-to-torch conversion by
mapping the incoming numpy dtype to a torch dtype (or use torch.as_tensor) to
avoid creating empty numpy arrays; specifically, replace the expression that
computes torch_dtype (currently torch.from_numpy(np.empty(0,
dtype=dtype)).dtype) with a direct mapping like NUMPY_TO_TORCH_DTYPE[dtype] or
obtain torch dtype via torch.as_tensor(...).dtype, then create host_tensor =
torch.empty(int(size), dtype=torch_dtype).pin_memory() and return host_tensor,
host_tensor.numpy(); update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or
get_torch_dtype_from_numpy) to centralize dtype mapping and reference the
variables torch_dtype, host_tensor, dtype, and size so the change is local and
clear.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eda3a9f4-ff6f-45c7-959d-282a05441deb

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and 6dad93b.

📒 Files selected for processing (1)

modelopt/onnx/quantization/autotune/benchmark.py

codecov · 2026-03-06T19:18:40Z

Codecov Report

❌ Patch coverage is 6.52174% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.20%. Comparing base (be6dfad) to head (d05634a).

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/autotune/benchmark.py	6.52%	43 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #998      +/-   ##
==========================================
+ Coverage   72.08%   72.20%   +0.11%     
==========================================
  Files         209      209              
  Lines       23638    23599      -39     
==========================================
  Hits        17040    17040              
+ Misses       6598     6559      -39

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/gpu/onnx/quantization/autotune/test_benchmark.py (1)
356-423: ⚠️ Potential issue | 🟠 Major

Move these end-to-end tests to tests/examples to keep GPU unit suite fast.

TestTensorRTPyBenchmark and TestTrtExecBenchmark (lines 356-423) execute real TensorRT engine builds and model inference with no mocking. These operations are inherently slow and will exceed the "few seconds" expectation for tests/gpu unit tests. Per coding guidelines, integration tests belong in tests/examples, while tests/gpu should contain fast GPU unit tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py` around lines 356 -
423, The end-to-end integration tests implemented by TestTensorRTPyBenchmark and
TestTrtExecBenchmark should be moved out of the fast GPU unit suite into the
integration/examples test folder; relocate the two classes (and their fixtures)
into tests/examples, update any imports and relative fixtures they rely on, and
add an appropriate pytest marker (e.g., `@pytest.mark.integration` or
`@pytest.mark.slow`) if you want them discoverable separately; ensure the original
file removes these classes and that any temp-file/cache cleanup and autouse
fixtures (like _require_tensorrt and _require_trtexec) are preserved or adapted
in the new location.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py`:
- Around line 356-423: The end-to-end integration tests implemented by
TestTensorRTPyBenchmark and TestTrtExecBenchmark should be moved out of the fast
GPU unit suite into the integration/examples test folder; relocate the two
classes (and their fixtures) into tests/examples, update any imports and
relative fixtures they rely on, and add an appropriate pytest marker (e.g.,
`@pytest.mark.integration` or `@pytest.mark.slow`) if you want them discoverable
separately; ensure the original file removes these classes and that any
temp-file/cache cleanup and autouse fixtures (like _require_tensorrt and
_require_trtexec) are preserved or adapted in the new location.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ffe06eac-a45d-4cf3-b3e0-dab2d17c5f98

📥 Commits

Reviewing files that changed from the base of the PR and between 6dad93b and e5748b1.

📒 Files selected for processing (1)

tests/gpu/onnx/quantization/autotune/test_benchmark.py

Copilot

Pull request overview

Refactors the ONNX autotune benchmarking utilities to use PyTorch’s CUDA APIs (pinned host buffers, device buffers, streams, and async copies) instead of cuda-python/CUDART, and expands GPU autotune benchmark test coverage accordingly.

Changes:

Replace CUDART-based host/device allocation and stream/memcpy logic in TensorRTPyBenchmark with PyTorch CUDA tensors/streams.
Simplify buffer cleanup to reference release (PyTorch-managed deallocation).
Add/expand unit, mocked, and integration tests for latency parsing, shape validation, buffer logic, and CUDA stream integration points.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`modelopt/onnx/quantization/autotune/benchmark.py`	Swaps CUDART memory/stream management for PyTorch CUDA tensors/streams and updates buffer allocation/copy paths.
`tests/gpu/onnx/quantization/autotune/test_benchmark.py`	Adds targeted unit tests and mocked/integration coverage for the updated benchmark behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

modelopt/onnx/quantization/autotune/benchmark.py

tests/gpu/onnx/quantization/autotune/test_benchmark.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)

530-531: Consider extracting dtype conversion to a helper.

The pattern torch.from_numpy(np.empty(0, dtype=dtype)).dtype is repeated here and at line 563. While functional, this could be cleaner with a small helper or direct use of a dtype mapping.

♻️ Optional: Extract dtype conversion helper

+    `@staticmethod`
+    def _numpy_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._numpy_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 530 - 531, The
dtype conversion expression torch.from_numpy(np.empty(0, dtype=dtype)).dtype is
duplicated; extract it into a small helper (e.g., numpy_dtype_to_torch or a
dtype_map) and replace both uses (the torch_dtype assignment near host_tensor
creation and the other occurrence around line 563) with a call to that helper to
improve readability and avoid repetition; update any callers to use the helper
when building torch tensors (e.g., where torch_dtype and host_tensor are
created) and ensure the helper returns a torch.dtype for direct use.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 534-537: The _free_buffers method is ineffective when invoked as
self._free_buffers(inputs + outputs) because inputs + outputs creates a
temporary list and bufs.clear() only clears that temp; either remove the unused
_free_buffers method entirely and rely on the existing del inputs, del outputs
cleanup, or change the call to clear each real list explicitly (e.g., call
self._free_buffers(inputs) and self._free_buffers(outputs) or replace
_free_buffers with explicit inputs.clear(); outputs.clear()) while keeping del
inputs/del outputs as appropriate; update references to the _free_buffers
function accordingly.

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 530-531: The dtype conversion expression
torch.from_numpy(np.empty(0, dtype=dtype)).dtype is duplicated; extract it into
a small helper (e.g., numpy_dtype_to_torch or a dtype_map) and replace both uses
(the torch_dtype assignment near host_tensor creation and the other occurrence
around line 563) with a call to that helper to improve readability and avoid
repetition; update any callers to use the helper when building torch tensors
(e.g., where torch_dtype and host_tensor are created) and ensure the helper
returns a torch.dtype for direct use.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23b00fbe-661b-4a3d-b4c7-43546e4ac679

📥 Commits

Reviewing files that changed from the base of the PR and between e5748b1 and 62bd6d7.

📒 Files selected for processing (2)

modelopt/onnx/quantization/autotune/benchmark.py
tests/gpu/onnx/quantization/autotune/test_benchmark.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/gpu/onnx/quantization/autotune/test_benchmark.py

modelopt/onnx/quantization/autotune/benchmark.py

coderabbitai

🧹 Nitpick comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)

524-532: Consider extracting dtype conversion to avoid duplication.

The pattern torch.from_numpy(np.empty(0, dtype=dtype)).dtype appears at lines 530 and 563. While functional, this could be slightly cleaner with a small helper.

♻️ Suggested helper function

+    `@staticmethod`
+    def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()

Then at line 563:

-            torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+            torch_dtype = self._np_to_torch_dtype(np_dtype)

Also applies to: 563-564

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 524 - 532,
Extract the repeated numpy-to-torch dtype conversion into a small helper (e.g.,
a function named numpy_to_torch_dtype or _np_to_torch_dtype) and use it wherever
`torch.from_numpy(np.empty(0, dtype=dtype)).dtype` is currently used (notably in
the `_alloc_pinned_host` function and the other occurrence around the later
allocation code). Update `_alloc_pinned_host` to call this helper to obtain
`torch_dtype` and replace the duplicated expression at the second site with the
same helper to remove duplication and improve clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 524-532: Extract the repeated numpy-to-torch dtype conversion into
a small helper (e.g., a function named numpy_to_torch_dtype or
_np_to_torch_dtype) and use it wherever `torch.from_numpy(np.empty(0,
dtype=dtype)).dtype` is currently used (notably in the `_alloc_pinned_host`
function and the other occurrence around the later allocation code). Update
`_alloc_pinned_host` to call this helper to obtain `torch_dtype` and replace the
duplicated expression at the second site with the same helper to remove
duplication and improve clarity.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ff02876-5aa1-4eb3-b47d-97ad899bbaa3

📥 Commits

Reviewing files that changed from the base of the PR and between 62bd6d7 and 343c8a5.

📒 Files selected for processing (1)

modelopt/onnx/quantization/autotune/benchmark.py

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)

562-564: Consider extracting dtype conversion to avoid duplication.

The torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype pattern appears both here and in _alloc_pinned_host. You could extract this to a small helper or have _alloc_pinned_host return the torch dtype as well.

♻️ Optional: Extract dtype conversion helper

+    `@staticmethod`
+    def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()

Then in _allocate_buffers:

             host_tensor, host_mem = self._alloc_pinned_host(size, np_dtype)
-            torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+            torch_dtype = self._np_to_torch_dtype(np_dtype)
             device_tensor = torch.empty(size, dtype=torch_dtype, device="cuda")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 562 - 564, The
dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype is
duplicated; extract it into a small helper (e.g., _to_torch_dtype(np_dtype)) or
change _alloc_pinned_host to also return the torch dtype so callers like
_allocate_buffers can reuse it instead of recomputing; update usages in
_allocate_buffers and _alloc_pinned_host to call the new helper or use the
returned torch dtype to eliminate duplication and keep semantics identical.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 562-564: The dtype conversion torch.from_numpy(np.empty(0,
dtype=np_dtype)).dtype is duplicated; extract it into a small helper (e.g.,
_to_torch_dtype(np_dtype)) or change _alloc_pinned_host to also return the torch
dtype so callers like _allocate_buffers can reuse it instead of recomputing;
update usages in _allocate_buffers and _alloc_pinned_host to call the new helper
or use the returned torch dtype to eliminate duplication and keep semantics
identical.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4ef052b-3932-474d-8665-8b53ece51682

📥 Commits

Reviewing files that changed from the base of the PR and between 343c8a5 and d05634a.

📒 Files selected for processing (2)

modelopt/onnx/quantization/autotune/benchmark.py
tests/gpu/onnx/quantization/autotune/test_benchmark.py

gcunhase requested a review from a team as a code owner March 6, 2026 19:05

gcunhase requested a review from vishalpandya1990 March 6, 2026 19:05

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

gcunhase mentioned this pull request Mar 6, 2026

Integrate Automated QDQ placement tool - part 4.4 #961

Open

cjluo-nv requested a review from Copilot March 6, 2026 19:55

Copilot started reviewing on behalf of cjluo-nv March 6, 2026 19:56 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/benchmark.py Outdated Show resolved Hide resolved

modelopt/onnx/quantization/autotune/benchmark.py Outdated Show resolved Hide resolved

modelopt/onnx/quantization/autotune/benchmark.py Show resolved Hide resolved

kevalmorabia97 reviewed Mar 6, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/benchmark.py Outdated Show resolved Hide resolved

kevalmorabia97 reviewed Mar 6, 2026

View reviewed changes

tests/gpu/onnx/quantization/autotune/test_benchmark.py Outdated Show resolved Hide resolved

gcunhase requested a review from cjluo-nv March 6, 2026 20:40

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/benchmark.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

gcunhase added 5 commits March 6, 2026 15:56

Replace CUDA memory management from CUDART to Torch

fff21af

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Added benchmark unittests

2864c6e

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Simplify torch import

9e1400d

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Remove requires_cuda check from tests

febb5bd

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Resolve copilot comments

d05634a

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

gcunhase force-pushed the dev/gcunhasergio/autotune_remove_cudart branch from 343c8a5 to d05634a Compare March 6, 2026 20:56

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

cjluo-nv approved these changes Mar 6, 2026

View reviewed changes

Conversation

gcunhase commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gcunhase commented Mar 6, 2026 •

edited

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading

codecov bot commented Mar 6, 2026 •

edited

Loading