[5694695][AutoCast] Preserve outer scope variable types in subgraphs #717

galagam · 2025-12-21T13:56:39Z

What does this PR do?

Type of change: Bug fix

Overview:
When clearing type information for shape inference, preserve value_info for outer scope variables in subgraphs. Previously, all value_info entries were cleared indiscriminately, causing shape inference failures when subgraph nodes referenced outer scope variables.

Testing

pytest tests/unit/onnx/autocast/test_precisionconverter.py::test_if_subgraph_outer_scope_type_preservation

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: N/A
Did you update Changelog?: No

…during type clearing When clearing type information for shape inference, preserve value_info for outer scope variables in subgraphs. Previously, all value_info entries were cleared indiscriminately, causing shape inference failures when subgraph nodes referenced outer scope variables. Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

codecov · 2025-12-21T14:07:29Z

Codecov Report

❌ Patch coverage is 73.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.68%. Comparing base (cb34335) to head (fd1dd71).

Files with missing lines	Patch %	Lines
modelopt/onnx/autocast/precisionconverter.py	73.33%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
- Coverage   74.69%   74.68%   -0.01%     
==========================================
  Files         192      192              
  Lines       18946    18956      +10     
==========================================
+ Hits        14152    14158       +6     
- Misses       4794     4798       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

galagam · 2025-12-23T09:35:07Z

@ajrasane @gcunhase please help review

jricker2 · 2025-12-23T20:33:38Z

I am actually running into an issue that is slightly related. Hopefully can get some clarification before looking further into creating a reproducer. I found that for a specific model with dynamic inputs, after converting to FP16 the FP16 graph cannot be run on CPU or Cuda. This is not the biggest deal since we plan to deploy on TensorRT anyway. Looking into it further the root cause is value_info for some operators being mostly "unk", rather than the real values added after shape inference. I end up with shape mismatch errors at inference time. Removing value_info from the converted model and re-updating value_info with shape inference values resolves the issue. This leads me to wonder about the purpose of filling value_info with undefined type and generic dims before running shape inference?

galagam · 2025-12-24T10:39:59Z

I am actually running into an issue that is slightly related. Hopefully can get some clarification before looking further into creating a reproducer. I found that for a specific model with dynamic inputs, after converting to FP16 the FP16 graph cannot be run on CPU or Cuda. This is not the biggest deal since we plan to deploy on TensorRT anyway. Looking into it further the root cause is value_info for some operators being mostly "unk", rather than the real values added after shape inference. I end up with shape mismatch errors at inference time. Removing value_info from the converted model and re-updating value_info with shape inference values resolves the issue. This leads me to wonder about the purpose of filling value_info with undefined type and generic dims before running shape inference?

Hey @jricker2 !

This is not the biggest deal since we plan to deploy on TensorRT anyway

Can you explain this statement? Do you mean deploy to TensorRT with weak typing (AKA --fp16 / kFP16 flag)? AutoCast is meant to prepare the model for deployment in TensorRT with strong typing. Please keep in mind that weak typing was deprecated and is planned to be removed in the next major release.

Looking into it further the root cause is value_info for some operators being mostly "unk", rather than the real values added after shape inference. I end up with shape mismatch errors at inference time.

This is a bug. Would you like to share the model so we can look into it?

This leads me to wonder about the purpose of filling value_info with undefined type and generic dims before running shape inference?

We want to ensure consistent type inference after adding cast nodes. The simplest way to do this is using ONNX's infer_shapes function which also performs type inference. If we leave the existing types, we'll get shape/type mismatches during shape inference.
The unnecessary coupling of type and shape inference is a known issue which should eventually be resolved in ONNX.
You might also be interested in this early draft for a workaround, using a standalone implementation for type inference only. This is still in draft mode, so no guarantees... #719

jricker2 · 2025-12-24T19:04:16Z

This is not the biggest deal since we plan to deploy on TensorRT anyway

Can you explain this statement? Do you mean deploy to TensorRT with weak typing (AKA --fp16 / kFP16 flag)? AutoCast is meant to prepare the model for deployment in TensorRT with strong typing. Please keep in mind that weak typing was deprecated and is planned to be removed in the next major release.

Apologies for the confusion here I worded this poorly. We use TensorRT EP in ONNXRuntime (which works with the converted model). We will not use the CUDA/CPU EP at any point so the failure I'm mentioning is not a show stopper by any means. I am working on pushing folks towards strongly typed models via AutoCast.

Looking into it further the root cause is value_info for some operators being mostly "unk", rather than the real values added after shape inference. I end up with shape mismatch errors at inference time.

This is a bug. Would you like to share the model so we can look into it?

Unfortunately I cannot share the full model, I will hopefully have an IP-free reproducer soon and will file an issue with that.

This leads me to wonder about the purpose of filling value_info with undefined type and generic dims before running shape inference?

We want to ensure consistent type inference after adding cast nodes. The simplest way to do this is using ONNX's infer_shapes function which also performs type inference. If we leave the existing types, we'll get shape/type mismatches during shape inference. The unnecessary coupling of type and shape inference is a known issue which should eventually be resolved in ONNX. You might also be interested in this early draft for a workaround, using a standalone implementation for type inference only. This is still in draft mode, so no guarantees... #719

This makes sense, I was wondering about the difference in having value_info be pre-populated with undefined types/generic dims (how it currently is), vs just being empty before running infer_shapes to populate it. This made the difference in making the model work for CPU/Cuda EP. I don't know enough about infer_shapes to know why this is the case. As a note, shape_inference passes on the converted model, it is actual inference on CPU/Cuda EP which fails.

The local shape inference is interesting, if I have some time I will try it out, thanks for the response.

galagam requested a review from a team as a code owner December 21, 2025 13:56

galagam requested a review from ajrasane December 21, 2025 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5694695][AutoCast] Preserve outer scope variable types in subgraphs #717

[5694695][AutoCast] Preserve outer scope variable types in subgraphs #717

galagam commented Dec 21, 2025

Uh oh!

codecov bot commented Dec 21, 2025

Uh oh!

galagam commented Dec 23, 2025

Uh oh!

jricker2 commented Dec 23, 2025

Uh oh!

galagam commented Dec 24, 2025 •

edited

Loading

Uh oh!

jricker2 commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[5694695][AutoCast] Preserve outer scope variable types in subgraphs #717

Are you sure you want to change the base?

[5694695][AutoCast] Preserve outer scope variable types in subgraphs #717

Conversation

galagam commented Dec 21, 2025

What does this PR do?

Testing

Before your PR is "Ready for review"

Uh oh!

codecov bot commented Dec 21, 2025

Codecov Report

Uh oh!

galagam commented Dec 23, 2025

Uh oh!

jricker2 commented Dec 23, 2025

Uh oh!

galagam commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jricker2 commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galagam commented Dec 24, 2025 •

edited

Loading