[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch#998
[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch#998gcunhase wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaced cudart usage in the TensorRT Python benchmark with PyTorch CUDA primitives: pinned host tensors, CUDA device tensors, and torch.cuda.Stream. Buffer allocation, host↔device transfers, stream handling, and CUDA availability checks now use torch.cuda and tensor.data_ptr() for TensorRT execution. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Host as Host (CPU)
participant PyTorch as PyTorch CUDA
participant TRT as TensorRT ExecutionContext
participant GPU as GPU (CUDA)
Host->>PyTorch: allocate pinned host tensor (pin_memory=True)
PyTorch->>GPU: create device tensor (torch.empty(..., device='cuda'))
Host->>PyTorch: non-blocking copy host -> device (tensor.to(device, non_blocking=True))
PyTorch->>TRT: execute_async_v3(binding_ptrs using data_ptr(), stream=stream)
TRT->>GPU: run kernels
GPU->>PyTorch: non-blocking copy device -> host (tensor.copy_(..., non_blocking=True))
PyTorch->>Host: synchronize stream and return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 error, 1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/onnx/quantization/autotune/benchmark.py (2)
567-569: Duplicate dtype conversion pattern.The
torch.from_numpy(np.empty(0, dtype=np_dtype)).dtypepattern is duplicated here and in_alloc_pinned_host. Consider extracting to a helper function or using the mapping suggested earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 567 - 569, The duplicated dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype used here and inside _alloc_pinned_host should be replaced with a shared helper (e.g., a function like np_to_torch_dtype) or the existing dtype mapping to centralize numpy->torch dtype conversion; add that helper (or mapping) and call it from both the _alloc_pinned_host implementation and the benchmark allocation site to return torch_dtype for a given np_dtype, then use the returned torch_dtype when creating device_tensor.
535-537: Consider usingtorch.as_tensoror a dtype mapping for cleaner conversion.The current approach creates temporary numpy arrays just to derive the torch dtype. A direct mapping or
torch.as_tensorwould be more efficient, though the current approach is functionally correct.♻️ Optional: Cleaner dtype conversion
+# At module level or as class attribute +_NP_TO_TORCH_DTYPE = { + np.float32: torch.float32, + np.float16: torch.float16, + np.int32: torch.int32, + np.int64: torch.int64, + np.int8: torch.int8, + np.uint8: torch.uint8, + np.bool_: torch.bool, +} `@staticmethod` def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]: """Allocate pinned host memory using PyTorch and return (tensor, numpy_view).""" - torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype + torch_dtype = _NP_TO_TORCH_DTYPE.get(dtype, torch.from_numpy(np.empty(0, dtype=dtype)).dtype) host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory() return host_tensor, host_tensor.numpy()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 535 - 537, Replace the temporary numpy-to-torch conversion by mapping the incoming numpy dtype to a torch dtype (or use torch.as_tensor) to avoid creating empty numpy arrays; specifically, replace the expression that computes torch_dtype (currently torch.from_numpy(np.empty(0, dtype=dtype)).dtype) with a direct mapping like NUMPY_TO_TORCH_DTYPE[dtype] or obtain torch dtype via torch.as_tensor(...).dtype, then create host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory() and return host_tensor, host_tensor.numpy(); update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or get_torch_dtype_from_numpy) to centralize dtype mapping and reference the variables torch_dtype, host_tensor, dtype, and size so the change is local and clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 567-569: The duplicated dtype conversion
torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype used here and inside
_alloc_pinned_host should be replaced with a shared helper (e.g., a function
like np_to_torch_dtype) or the existing dtype mapping to centralize numpy->torch
dtype conversion; add that helper (or mapping) and call it from both the
_alloc_pinned_host implementation and the benchmark allocation site to return
torch_dtype for a given np_dtype, then use the returned torch_dtype when
creating device_tensor.
- Around line 535-537: Replace the temporary numpy-to-torch conversion by
mapping the incoming numpy dtype to a torch dtype (or use torch.as_tensor) to
avoid creating empty numpy arrays; specifically, replace the expression that
computes torch_dtype (currently torch.from_numpy(np.empty(0,
dtype=dtype)).dtype) with a direct mapping like NUMPY_TO_TORCH_DTYPE[dtype] or
obtain torch dtype via torch.as_tensor(...).dtype, then create host_tensor =
torch.empty(int(size), dtype=torch_dtype).pin_memory() and return host_tensor,
host_tensor.numpy(); update or add a small helper (e.g., NUMPY_TO_TORCH_DTYPE or
get_torch_dtype_from_numpy) to centralize dtype mapping and reference the
variables torch_dtype, host_tensor, dtype, and size so the change is local and
clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eda3a9f4-ff6f-45c7-959d-282a05441deb
📒 Files selected for processing (1)
modelopt/onnx/quantization/autotune/benchmark.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #998 +/- ##
==========================================
+ Coverage 72.08% 72.20% +0.11%
==========================================
Files 209 209
Lines 23638 23599 -39
==========================================
Hits 17040 17040
+ Misses 6598 6559 -39 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/gpu/onnx/quantization/autotune/test_benchmark.py (1)
356-423:⚠️ Potential issue | 🟠 MajorMove these end-to-end tests to
tests/examplesto keep GPU unit suite fast.TestTensorRTPyBenchmark and TestTrtExecBenchmark (lines 356-423) execute real TensorRT engine builds and model inference with no mocking. These operations are inherently slow and will exceed the "few seconds" expectation for
tests/gpuunit tests. Per coding guidelines, integration tests belong intests/examples, whiletests/gpushould contain fast GPU unit tests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py` around lines 356 - 423, The end-to-end integration tests implemented by TestTensorRTPyBenchmark and TestTrtExecBenchmark should be moved out of the fast GPU unit suite into the integration/examples test folder; relocate the two classes (and their fixtures) into tests/examples, update any imports and relative fixtures they rely on, and add an appropriate pytest marker (e.g., `@pytest.mark.integration` or `@pytest.mark.slow`) if you want them discoverable separately; ensure the original file removes these classes and that any temp-file/cache cleanup and autouse fixtures (like _require_tensorrt and _require_trtexec) are preserved or adapted in the new location.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/gpu/onnx/quantization/autotune/test_benchmark.py`:
- Around line 356-423: The end-to-end integration tests implemented by
TestTensorRTPyBenchmark and TestTrtExecBenchmark should be moved out of the fast
GPU unit suite into the integration/examples test folder; relocate the two
classes (and their fixtures) into tests/examples, update any imports and
relative fixtures they rely on, and add an appropriate pytest marker (e.g.,
`@pytest.mark.integration` or `@pytest.mark.slow`) if you want them discoverable
separately; ensure the original file removes these classes and that any
temp-file/cache cleanup and autouse fixtures (like _require_tensorrt and
_require_trtexec) are preserved or adapted in the new location.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ffe06eac-a45d-4cf3-b3e0-dab2d17c5f98
📒 Files selected for processing (1)
tests/gpu/onnx/quantization/autotune/test_benchmark.py
There was a problem hiding this comment.
Pull request overview
Refactors the ONNX autotune benchmarking utilities to use PyTorch’s CUDA APIs (pinned host buffers, device buffers, streams, and async copies) instead of cuda-python/CUDART, and expands GPU autotune benchmark test coverage accordingly.
Changes:
- Replace CUDART-based host/device allocation and stream/memcpy logic in
TensorRTPyBenchmarkwith PyTorch CUDA tensors/streams. - Simplify buffer cleanup to reference release (PyTorch-managed deallocation).
- Add/expand unit, mocked, and integration tests for latency parsing, shape validation, buffer logic, and CUDA stream integration points.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
modelopt/onnx/quantization/autotune/benchmark.py |
Swaps CUDART memory/stream management for PyTorch CUDA tensors/streams and updates buffer allocation/copy paths. |
tests/gpu/onnx/quantization/autotune/test_benchmark.py |
Adds targeted unit tests and mocked/integration coverage for the updated benchmark behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
530-531: Consider extracting dtype conversion to a helper.The pattern
torch.from_numpy(np.empty(0, dtype=dtype)).dtypeis repeated here and at line 563. While functional, this could be cleaner with a small helper or direct use of a dtype mapping.♻️ Optional: Extract dtype conversion helper
+ `@staticmethod` + def _numpy_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype: + """Convert numpy dtype to torch dtype.""" + return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype + `@staticmethod` def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]: """Allocate pinned host memory using PyTorch and return (tensor, numpy_view). Returns: (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it. """ - torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype + torch_dtype = TensorRTPyBenchmark._numpy_to_torch_dtype(dtype) host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory() return host_tensor, host_tensor.numpy()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 530 - 531, The dtype conversion expression torch.from_numpy(np.empty(0, dtype=dtype)).dtype is duplicated; extract it into a small helper (e.g., numpy_dtype_to_torch or a dtype_map) and replace both uses (the torch_dtype assignment near host_tensor creation and the other occurrence around line 563) with a call to that helper to improve readability and avoid repetition; update any callers to use the helper when building torch tensors (e.g., where torch_dtype and host_tensor are created) and ensure the helper returns a torch.dtype for direct use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 534-537: The _free_buffers method is ineffective when invoked as
self._free_buffers(inputs + outputs) because inputs + outputs creates a
temporary list and bufs.clear() only clears that temp; either remove the unused
_free_buffers method entirely and rely on the existing del inputs, del outputs
cleanup, or change the call to clear each real list explicitly (e.g., call
self._free_buffers(inputs) and self._free_buffers(outputs) or replace
_free_buffers with explicit inputs.clear(); outputs.clear()) while keeping del
inputs/del outputs as appropriate; update references to the _free_buffers
function accordingly.
---
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 530-531: The dtype conversion expression
torch.from_numpy(np.empty(0, dtype=dtype)).dtype is duplicated; extract it into
a small helper (e.g., numpy_dtype_to_torch or a dtype_map) and replace both uses
(the torch_dtype assignment near host_tensor creation and the other occurrence
around line 563) with a call to that helper to improve readability and avoid
repetition; update any callers to use the helper when building torch tensors
(e.g., where torch_dtype and host_tensor are created) and ensure the helper
returns a torch.dtype for direct use.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 23b00fbe-661b-4a3d-b4c7-43546e4ac679
📒 Files selected for processing (2)
modelopt/onnx/quantization/autotune/benchmark.pytests/gpu/onnx/quantization/autotune/test_benchmark.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/gpu/onnx/quantization/autotune/test_benchmark.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
524-532: Consider extracting dtype conversion to avoid duplication.The pattern
torch.from_numpy(np.empty(0, dtype=dtype)).dtypeappears at lines 530 and 563. While functional, this could be slightly cleaner with a small helper.♻️ Suggested helper function
+ `@staticmethod` + def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype: + """Convert numpy dtype to torch dtype.""" + return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype + `@staticmethod` def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]: """Allocate pinned host memory using PyTorch and return (tensor, numpy_view). Returns: (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it. """ - torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype + torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype) host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory() return host_tensor, host_tensor.numpy()Then at line 563:
- torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype + torch_dtype = self._np_to_torch_dtype(np_dtype)Also applies to: 563-564
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 524 - 532, Extract the repeated numpy-to-torch dtype conversion into a small helper (e.g., a function named numpy_to_torch_dtype or _np_to_torch_dtype) and use it wherever `torch.from_numpy(np.empty(0, dtype=dtype)).dtype` is currently used (notably in the `_alloc_pinned_host` function and the other occurrence around the later allocation code). Update `_alloc_pinned_host` to call this helper to obtain `torch_dtype` and replace the duplicated expression at the second site with the same helper to remove duplication and improve clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 524-532: Extract the repeated numpy-to-torch dtype conversion into
a small helper (e.g., a function named numpy_to_torch_dtype or
_np_to_torch_dtype) and use it wherever `torch.from_numpy(np.empty(0,
dtype=dtype)).dtype` is currently used (notably in the `_alloc_pinned_host`
function and the other occurrence around the later allocation code). Update
`_alloc_pinned_host` to call this helper to obtain `torch_dtype` and replace the
duplicated expression at the second site with the same helper to remove
duplication and improve clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5ff02876-5aa1-4eb3-b47d-97ad899bbaa3
📒 Files selected for processing (1)
modelopt/onnx/quantization/autotune/benchmark.py
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
343c8a5 to
d05634a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
562-564: Consider extracting dtype conversion to avoid duplication.The
torch.from_numpy(np.empty(0, dtype=np_dtype)).dtypepattern appears both here and in_alloc_pinned_host. You could extract this to a small helper or have_alloc_pinned_hostreturn the torch dtype as well.♻️ Optional: Extract dtype conversion helper
+ `@staticmethod` + def _np_to_torch_dtype(np_dtype: np.dtype) -> torch.dtype: + """Convert numpy dtype to torch dtype.""" + return torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype + `@staticmethod` def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]: """Allocate pinned host memory using PyTorch and return (tensor, numpy_view). Returns: (host_tensor, arr): Pinned PyTorch tensor and a numpy view over it. """ - torch_dtype = torch.from_numpy(np.empty(0, dtype=dtype)).dtype + torch_dtype = TensorRTPyBenchmark._np_to_torch_dtype(dtype) host_tensor = torch.empty(int(size), dtype=torch_dtype).pin_memory() return host_tensor, host_tensor.numpy()Then in
_allocate_buffers:host_tensor, host_mem = self._alloc_pinned_host(size, np_dtype) - torch_dtype = torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype + torch_dtype = self._np_to_torch_dtype(np_dtype) device_tensor = torch.empty(size, dtype=torch_dtype, device="cuda")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 562 - 564, The dtype conversion torch.from_numpy(np.empty(0, dtype=np_dtype)).dtype is duplicated; extract it into a small helper (e.g., _to_torch_dtype(np_dtype)) or change _alloc_pinned_host to also return the torch dtype so callers like _allocate_buffers can reuse it instead of recomputing; update usages in _allocate_buffers and _alloc_pinned_host to call the new helper or use the returned torch dtype to eliminate duplication and keep semantics identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 562-564: The dtype conversion torch.from_numpy(np.empty(0,
dtype=np_dtype)).dtype is duplicated; extract it into a small helper (e.g.,
_to_torch_dtype(np_dtype)) or change _alloc_pinned_host to also return the torch
dtype so callers like _allocate_buffers can reuse it instead of recomputing;
update usages in _allocate_buffers and _alloc_pinned_host to call the new helper
or use the returned torch dtype to eliminate duplication and keep semantics
identical.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e4ef052b-3932-474d-8665-8b53ece51682
📒 Files selected for processing (2)
modelopt/onnx/quantization/autotune/benchmark.pytests/gpu/onnx/quantization/autotune/test_benchmark.py
What does this PR do?
Type of change: Bug fix
Overview: Replace CUDA memory management from CUDART to PyTorch (higher-level API).
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary of changes in
benchmark.py — TensorRTPyBenchmark:contextlib+from cuda.bindings import runtime as cudartimport torch(conditional)CUDART_AVAILABLETORCH_CUDA_AVAILABLE = torch.cuda.is_available()__init__guardCUDART_AVAILABLE or cudart is NoneTORCH_CUDA_AVAILABLE_alloc_pinned_hostcudaMallocHost+ ctypes address hack, returns(ptr, arr, err)torch.empty(...).pin_memory(), returns(tensor, tensor.numpy())_free_bufferscudaFreeHost+cudaFreeper bufferbufs.clear()— PyTorch GC handles deallocation_allocate_buffersdevice_ptrintegers, error-code returnstorch.empty(..., device="cuda"),tensor.data_ptr()for TRT address_run_warmupcudaMemcpyAsync+cudaStreamSynchronizetensor.copy_(non_blocking=True)insidetorch.cuda.stream()_run_timingrun— stream lifecyclecudaStreamCreate()/cudaStreamDestroy()torch.cuda.Stream()/del streamrun— stream arg to TRTstream.cuda_stream(integer property)cudaError_treturn codesRuntimeError, caught by existingexcept ExceptionRelated to #961
Summary by CodeRabbit