Switch pytorch_inf from float32 to auto to fix OOM on 16GB machines#10
Merged
jewilder merged 1 commit intomicrosoft:mainfrom Apr 2, 2026
Merged
Conversation
… example and fix OOM on 16GB machines Two bugs fixed: 1. Wrong parameter name: used 'dtype=' instead of 'torch_dtype=' for from_pretrained(), which was likely silently ignored, causing the model to load in default (float32) precision. 2. Hardcoded dtype logic: float16 for CUDA / float32 for CPU instead of using the model's native bfloat16. Phi-4-mini weights are stored as bfloat16 per config.json. Fix: Use torch_dtype='auto' which reads the dtype from the model's config.json (bfloat16), matching the official HuggingFace example. This halves CPU memory from ~15.2GB to ~7.6GB and uses the correct precision on all devices. Changes: - Fixed parameter name from 'dtype' to 'torch_dtype' in both setup_model() and main() - Changed value to 'auto' (resolves to bfloat16 from model config) - Applied to both Windows and macOS inference.py - Bumped prep_version: Windows 9->10, macOS 5->6
jewilder
approved these changes
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The pytorch_inf
inference.pyscript was failing with out-of-memory errors on 16GB devices. Investigation revealed two bugs in how the model was being loaded:Bug 1: Wrong parameter name (
dtypevstorch_dtype)The script used
dtype=when callingAutoModelForCausalLM.from_pretrained():The correct parameter is
torch_dtype. Thedtypekwarg was likely silently ignored via**kwargs, causing the model to load in its framework default precision (float32).Bug 2: Hardcoded dtype instead of using model's native precision
Even if the parameter name were correct, the logic hardcoded
float16for CUDA andfloat32for CPU. Phi-4-mini-instruct's weights are natively stored as bfloat16 (per its config.json:"torch_dtype": "bfloat16").Fix
Replaced with
torch_dtype="auto", which reads the dtype directly from the model'sconfig.json. This matches the official HuggingFace example:Why
torch_dtype="auto"is the right approachdtype=torch.float32(old, broken)torch_dtype=torch.bfloat16(hardcoded)torch_dtype="auto"Changes
dtypetotorch_dtypein bothsetup_model()andmain()load paths"auto"(resolves to bfloat16 from Phi-4-mini config.json)inference.pyprep_version: Windows 9→10, macOS 5→6Files Changed
scenarios/windows/pytorch_inf/pytorch_inf_resources/inference.pyscenarios/macos/mac_pytorch_inf/mac_pytorch_inf_resources/inference.pyscenarios/windows/pytorch_inf/pytorch_inf.pyscenarios/macos/mac_pytorch_inf/mac_pytorch_inf.py