Skip to content

[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch#998

Open
gcunhase wants to merge 5 commits intoNVIDIA:mainfrom
gcunhase:dev/gcunhasergio/autotune_remove_cudart
Open

[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch#998
gcunhase wants to merge 5 commits intoNVIDIA:mainfrom
gcunhase:dev/gcunhasergio/autotune_remove_cudart

Conversation

@gcunhase
Copy link
Contributor

@gcunhase gcunhase commented Mar 6, 2026

What does this PR do?

Type of change: Bug fix

Overview: Replace CUDA memory management from CUDART to PyTorch (higher-level API).

Usage

# Add a code snippet demonstrating how to use this

Testing

  1. Added unittests.
  2. Tested that this PR does not break [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951 or [5951713] Fix benchmark allocation failure #978

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A

Additional Information

Summary of changes in benchmark.py — TensorRTPyBenchmark:

What changed Before After
Imports contextlib + from cuda.bindings import runtime as cudart import torch (conditional)
Availability flag CUDART_AVAILABLE TORCH_CUDA_AVAILABLE = torch.cuda.is_available()
__init__ guard checks CUDART_AVAILABLE or cudart is None checks TORCH_CUDA_AVAILABLE
_alloc_pinned_host cudaMallocHost + ctypes address hack, returns (ptr, arr, err) torch.empty(...).pin_memory(), returns (tensor, tensor.numpy())
_free_buffers cudaFreeHost + cudaFree per buffer bufs.clear() — PyTorch GC handles deallocation
_allocate_buffers raw device_ptr integers, error-code returns torch.empty(..., device="cuda"), tensor.data_ptr() for TRT address
_run_warmup cudaMemcpyAsync + cudaStreamSynchronize tensor.copy_(non_blocking=True) inside torch.cuda.stream()
_run_timing same cudart pattern same torch pattern
run — stream lifecycle cudaStreamCreate() / cudaStreamDestroy() torch.cuda.Stream() / del stream
run — stream arg to TRT raw integer handle stream.cuda_stream (integer property)
Error handling cudaError_t return codes PyTorch raises RuntimeError, caught by existing except Exception

Related to #961

Summary by CodeRabbit

  • Refactor
    • TensorRT benchmarking migrated from direct CUDA runtime calls to PyTorch CUDA tensors, pinned memory, and CUDA stream primitives — simplifying buffer management, transfers, and timing semantics.
  • Tests
    • Expanded GPU autotune benchmark tests with broader unit and integration coverage for CUDA/TensorRT paths, pinned-host/device buffering, stream behavior, warmup/timing, and end-to-end latency scenarios.

@gcunhase gcunhase requested a review from a team as a code owner March 6, 2026 19:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced cudart usage in the TensorRT Python benchmark with PyTorch CUDA primitives: pinned host tensors, CUDA device tensors, and torch.cuda.Stream. Buffer allocation, host↔device transfers, stream handling, and CUDA availability checks now use torch.cuda and tensor.data_ptr() for TensorRT execution.

Changes

Cohort / File(s) Summary
TensorRT Python benchmark
modelopt/onnx/quantization/autotune/benchmark.py
Converted TensorRTPyBenchmark from cudart to PyTorch CUDA: _alloc_pinned_host now returns a pinned torch tensor and numpy view; _allocate_buffers returns lists of dicts {"host":..., "device":..., "name":...} using torch tensors; _free_buffers clears refs; stream params use torch.cuda.Stream; execution uses context.execute_async_v3 with .data_ptr() and non-blocking tensor copies. Removed cudart availability and error-code handling; updated signatures and docs.
Autotune benchmark tests
tests/gpu/onnx/quantization/autotune/test_benchmark.py
Expanded and refactored tests: added fixtures and helpers (trtexec_bench, _make_bench, _make_buffers) to mock TensorRT/CUDA flows and produce PyTorch-pinned buffers; added unit and integration tests for shape validation, pinned allocation, buffer free, CUDA stream warmup/timing, and trtexec subprocess behaviors; adjusted guards to rely on torch.cuda availability.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host as Host (CPU)
    participant PyTorch as PyTorch CUDA
    participant TRT as TensorRT ExecutionContext
    participant GPU as GPU (CUDA)

    Host->>PyTorch: allocate pinned host tensor (pin_memory=True)
    PyTorch->>GPU: create device tensor (torch.empty(..., device='cuda'))
    Host->>PyTorch: non-blocking copy host -> device (tensor.to(device, non_blocking=True))
    PyTorch->>TRT: execute_async_v3(binding_ptrs using data_ptr(), stream=stream)
    TRT->>GPU: run kernels
    GPU->>PyTorch: non-blocking copy device -> host (tensor.copy_(..., non_blocking=True))
    PyTorch->>Host: synchronize stream and return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR introduces three # nosec comments in benchmark.py to bypass Bandit security checks, violating SECURITY.md guidelines that explicitly prohibit this practice. Remove all # nosec comments and follow required approval process: add inline comments explaining security-sensitive patterns, add justification in PR description, and request @NVIDIA/modelopt-setup-codeowners review.
Docstring Coverage ⚠️ Warning Docstring coverage is 41.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: replacing CUDA memory management from CUDART to PyTorch in the ONNX autotune module, which is the primary focus of the changeset.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
modelopt/onnx/quantization/autotune/benchmark.py (2)

567-569: Duplicate dtype conversion pattern.

The torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype pattern is duplicated here and in _alloc_pinned_host. Consider extracting to a helper function or using the mapping suggested earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 567 - 569, The
duplicated dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
used here and inside _alloc_pinned_host should be replaced with a shared helper
(e.g., a function like np_to_torch_dtype) or the existing dtype mapping to
centralize numpy->torch dtype conversion; add that helper (or mapping) and call
it from both the _alloc_pinned_host implementation and the benchmark allocation
site to return torch_dtype for a given np_dtype, then use the returned
torch_dtype when creating device_tensor.

535-537: Consider using torch.as_tensor or a dtype mapping for cleaner conversion.

The current approach creates temporary numpy arrays just to derive the torch dtype. A direct mapping or torch.as_tensor would be more efficient, though the current approach is functionally correct.

♻️ Optional: Cleaner dtype conversion
+# At module level or as class attribute
+_NP_TO_TORCH_DTYPE = {
+    np.float32: torch.float32,
+    np.float16: torch.float16,
+    np.int32: torch.int32,
+    np.int64: torch.int64,
+    np.int8: torch.int8,
+    np.uint8: torch.uint8,
+    np.bool_: torch.bool,
+}

 `@staticmethod`
 def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
     """Allocate pinned host memory using PyTorch and return (tensor, numpy_view)."""
-    torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+    torch_dtype = _NP_TO_TORCH_DTYPE.get(dtype, torch.from_numpy(np.empty(0, dtype=dtype)).dtype)
     host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
     return host_tensor, host_tensor.numpy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 535 - 537,
Replace the temporary numpy-to-torch conversion by mapping the incoming numpy
dtype to a torch dtype (or use torch.as_tensor) to avoid creating empty numpy
arrays; specifically, replace the expression that computes torch_dtype
(currently torch.from_numpy(np.empty(0, dtype=dtype)).dtype) with a direct
mapping like NUMPY_TO_TORCH_DTYPE[dtype] or obtain torch dtype via
torch.as_tensor(...).dtype, then create host_tensor = torch.empty(int(size),
dtype=torch_dtype).pin_memory() and return host_tensor, host_tensor.numpy();
update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or
get_torch_dtype_from_numpy) to centralize dtype mapping and reference the
variables torch_dtype, host_tensor, dtype, and size so the change is local and
clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 567-569: The duplicated dtype conversion
torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype used here and inside
_alloc_pinned_host should be replaced with a shared helper (e.g., a function
like np_to_torch_dtype) or the existing dtype mapping to centralize numpy->torch
dtype conversion; add that helper (or mapping) and call it from both the
_alloc_pinned_host implementation and the benchmark allocation site to return
torch_dtype for a given np_dtype, then use the returned torch_dtype when
creating device_tensor.
- Around line 535-537: Replace the temporary numpy-to-torch conversion by
mapping the incoming numpy dtype to a torch dtype (or use torch.as_tensor) to
avoid creating empty numpy arrays; specifically, replace the expression that
computes torch_dtype (currently torch.from_numpy(np.empty(0,
dtype=dtype)).dtype) with a direct mapping like NUMPY_TO_TORCH_DTYPE[dtype] or
obtain torch dtype via torch.as_tensor(...).dtype, then create host_tensor =
torch.empty(int(size), dtype=torch_dtype).pin_memory() and return host_tensor,
host_tensor.numpy(); update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or
get_torch_dtype_from_numpy) to centralize dtype mapping and reference the
variables torch_dtype, host_tensor, dtype, and size so the change is local and
clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eda3a9f4-ff6f-45c7-959d-282a05441deb

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and 6dad93b.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/autotune/benchmark.py

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 6.52174% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.20%. Comparing base (be6dfad) to head (d05634a).

Files with missing lines Patch % Lines
modelopt/onnx/quantization/autotune/benchmark.py 6.52% 43 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #998      +/-   ##
==========================================
+ Coverage   72.08%   72.20%   +0.11%     
==========================================
  Files         209      209              
  Lines       23638    23599      -39     
==========================================
  Hits        17040    17040              
+ Misses       6598     6559      -39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/gpu/onnx/quantization/autotune/test_benchmark.py (1)

356-423: ⚠️ Potential issue | 🟠 Major

Move these end-to-end tests to tests/examples to keep GPU unit suite fast.

TestTensorRTPyBenchmark and TestTrtExecBenchmark (lines 356-423) execute real TensorRT engine builds and model inference with no mocking. These operations are inherently slow and will exceed the "few seconds" expectation for tests/gpu unit tests. Per coding guidelines, integration tests belong in tests/examples, while tests/gpu should contain fast GPU unit tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py` around lines 356 -
423, The end-to-end integration tests implemented by TestTensorRTPyBenchmark and
TestTrtExecBenchmark should be moved out of the fast GPU unit suite into the
integration/examples test folder; relocate the two classes (and their fixtures)
into tests/examples, update any imports and relative fixtures they rely on, and
add an appropriate pytest marker (e.g., `@pytest.mark.integration` or
`@pytest.mark.slow`) if you want them discoverable separately; ensure the original
file removes these classes and that any temp-file/cache cleanup and autouse
fixtures (like _require_tensorrt and _require_trtexec) are preserved or adapted
in the new location.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py`:
- Around line 356-423: The end-to-end integration tests implemented by
TestTensorRTPyBenchmark and TestTrtExecBenchmark should be moved out of the fast
GPU unit suite into the integration/examples test folder; relocate the two
classes (and their fixtures) into tests/examples, update any imports and
relative fixtures they rely on, and add an appropriate pytest marker (e.g.,
`@pytest.mark.integration` or `@pytest.mark.slow`) if you want them discoverable
separately; ensure the original file removes these classes and that any
temp-file/cache cleanup and autouse fixtures (like _require_tensorrt and
_require_trtexec) are preserved or adapted in the new location.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ffe06eac-a45d-4cf3-b3e0-dab2d17c5f98

📥 Commits

Reviewing files that changed from the base of the PR and between 6dad93b and e5748b1.

📒 Files selected for processing (1)
  • tests/gpu/onnx/quantization/autotune/test_benchmark.py

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the ONNX autotune benchmarking utilities to use PyTorch’s CUDA APIs (pinned host buffers, device buffers, streams, and async copies) instead of cuda-python/CUDART, and expands GPU autotune benchmark test coverage accordingly.

Changes:

  • Replace CUDART-based host/device allocation and stream/memcpy logic in TensorRTPyBenchmark with PyTorch CUDA tensors/streams.
  • Simplify buffer cleanup to reference release (PyTorch-managed deallocation).
  • Add/expand unit, mocked, and integration tests for latency parsing, shape validation, buffer logic, and CUDA stream integration points.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
modelopt/onnx/quantization/autotune/benchmark.py Swaps CUDART memory/stream management for PyTorch CUDA tensors/streams and updates buffer allocation/copy paths.
tests/gpu/onnx/quantization/autotune/test_benchmark.py Adds targeted unit tests and mocked/integration coverage for the updated benchmark behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gcunhase gcunhase requested a review from cjluo-nv March 6, 2026 20:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)

530-531: Consider extracting dtype conversion to a helper.

The pattern torch.from_numpy(np.empty(0, dtype=dtype)).dtype is repeated here and at line 563. While functional, this could be cleaner with a small helper or direct use of a dtype mapping.

♻️ Optional: Extract dtype conversion helper
+    `@staticmethod`
+    def _numpy_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._numpy_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 530 - 531, The
dtype conversion expression torch.from_numpy(np.empty(0, dtype=dtype)).dtype is
duplicated; extract it into a small helper (e.g., numpy_dtype_to_torch or a
dtype_map) and replace both uses (the torch_dtype assignment near host_tensor
creation and the other occurrence around line 563) with a call to that helper to
improve readability and avoid repetition; update any callers to use the helper
when building torch tensors (e.g., where torch_dtype and host_tensor are
created) and ensure the helper returns a torch.dtype for direct use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 534-537: The _free_buffers method is ineffective when invoked as
self._free_buffers(inputs + outputs) because inputs + outputs creates a
temporary list and bufs.clear() only clears that temp; either remove the unused
_free_buffers method entirely and rely on the existing del inputs, del outputs
cleanup, or change the call to clear each real list explicitly (e.g., call
self._free_buffers(inputs) and self._free_buffers(outputs) or replace
_free_buffers with explicit inputs.clear(); outputs.clear()) while keeping del
inputs/del outputs as appropriate; update references to the _free_buffers
function accordingly.

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 530-531: The dtype conversion expression
torch.from_numpy(np.empty(0, dtype=dtype)).dtype is duplicated; extract it into
a small helper (e.g., numpy_dtype_to_torch or a dtype_map) and replace both uses
(the torch_dtype assignment near host_tensor creation and the other occurrence
around line 563) with a call to that helper to improve readability and avoid
repetition; update any callers to use the helper when building torch tensors
(e.g., where torch_dtype and host_tensor are created) and ensure the helper
returns a torch.dtype for direct use.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23b00fbe-661b-4a3d-b4c7-43546e4ac679

📥 Commits

Reviewing files that changed from the base of the PR and between e5748b1 and 62bd6d7.

📒 Files selected for processing (2)
  • modelopt/onnx/quantization/autotune/benchmark.py
  • tests/gpu/onnx/quantization/autotune/test_benchmark.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/gpu/onnx/quantization/autotune/test_benchmark.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)

524-532: Consider extracting dtype conversion to avoid duplication.

The pattern torch.from_numpy(np.empty(0, dtype=dtype)).dtype appears at lines 530 and 563. While functional, this could be slightly cleaner with a small helper.

♻️ Suggested helper function
+    `@staticmethod`
+    def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()

Then at line 563:

-            torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+            torch_dtype = self._np_to_torch_dtype(np_dtype)

Also applies to: 563-564

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 524 - 532,
Extract the repeated numpy-to-torch dtype conversion into a small helper (e.g.,
a function named numpy_to_torch_dtype or _np_to_torch_dtype) and use it wherever
`torch.from_numpy(np.empty(0, dtype=dtype)).dtype` is currently used (notably in
the `_alloc_pinned_host` function and the other occurrence around the later
allocation code). Update `_alloc_pinned_host` to call this helper to obtain
`torch_dtype` and replace the duplicated expression at the second site with the
same helper to remove duplication and improve clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 524-532: Extract the repeated numpy-to-torch dtype conversion into
a small helper (e.g., a function named numpy_to_torch_dtype or
_np_to_torch_dtype) and use it wherever `torch.from_numpy(np.empty(0,
dtype=dtype)).dtype` is currently used (notably in the `_alloc_pinned_host`
function and the other occurrence around the later allocation code). Update
`_alloc_pinned_host` to call this helper to obtain `torch_dtype` and replace the
duplicated expression at the second site with the same helper to remove
duplication and improve clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ff02876-5aa1-4eb3-b47d-97ad899bbaa3

📥 Commits

Reviewing files that changed from the base of the PR and between 62bd6d7 and 343c8a5.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/autotune/benchmark.py

gcunhase added 5 commits March 6, 2026 15:56
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
@gcunhase gcunhase force-pushed the dev/gcunhasergio/autotune_remove_cudart branch from 343c8a5 to d05634a Compare March 6, 2026 20:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)

562-564: Consider extracting dtype conversion to avoid duplication.

The torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype pattern appears both here and in _alloc_pinned_host. You could extract this to a small helper or have _alloc_pinned_host return the torch dtype as well.

♻️ Optional: Extract dtype conversion helper
+    `@staticmethod`
+    def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype:
+        """Convert numpy dtype to torch dtype."""
+        return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+
     `@staticmethod`
     def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
         """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).

         Returns:
             (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it.
         """
-        torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype
+        torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype)
         host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory()
         return host_tensor, host_tensor.numpy()

Then in _allocate_buffers:

             host_tensor, host_mem = self._alloc_pinned_host(size, np_dtype)
-            torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype
+            torch_dtype = self._np_to_torch_dtype(np_dtype)
             device_tensor = torch.empty(size, dtype=torch_dtype, device="cuda")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 562 - 564, The
dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype is
duplicated; extract it into a small helper (e.g., _to_torch_dtype(np_dtype)) or
change _alloc_pinned_host to also return the torch dtype so callers like
_allocate_buffers can reuse it instead of recomputing; update usages in
_allocate_buffers and _alloc_pinned_host to call the new helper or use the
returned torch dtype to eliminate duplication and keep semantics identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 562-564: The dtype conversion torch.from_numpy(np.empty(0,
dtype=np_dtype)).dtype is duplicated; extract it into a small helper (e.g.,
_to_torch_dtype(np_dtype)) or change _alloc_pinned_host to also return the torch
dtype so callers like _allocate_buffers can reuse it instead of recomputing;
update usages in _allocate_buffers and _alloc_pinned_host to call the new helper
or use the returned torch dtype to eliminate duplication and keep semantics
identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4ef052b-3932-474d-8665-8b53ece51682

📥 Commits

Reviewing files that changed from the base of the PR and between 343c8a5 and d05634a.

📒 Files selected for processing (2)
  • modelopt/onnx/quantization/autotune/benchmark.py
  • tests/gpu/onnx/quantization/autotune/test_benchmark.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants