Skip to content

Switch pytorch_inf from float32 to auto to fix OOM on 16GB machines#10

Merged
jewilder merged 1 commit intomicrosoft:mainfrom
philnach:fix/pytorch-inf-bfloat16-memory
Apr 2, 2026
Merged

Switch pytorch_inf from float32 to auto to fix OOM on 16GB machines#10
jewilder merged 1 commit intomicrosoft:mainfrom
philnach:fix/pytorch-inf-bfloat16-memory

Conversation

@philnach
Copy link
Copy Markdown
Member

@philnach philnach commented Apr 2, 2026

Problem

The pytorch_inf inference.py script was failing with out-of-memory errors on 16GB devices. Investigation revealed two bugs in how the model was being loaded:

Bug 1: Wrong parameter name (dtype vs torch_dtype)

The script used dtype= when calling AutoModelForCausalLM.from_pretrained():

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16 if device == 'cuda' else torch.float32,  # wrong parameter name
    ...
)

The correct parameter is torch_dtype. The dtype kwarg was likely silently ignored via **kwargs, causing the model to load in its framework default precision (float32).

Bug 2: Hardcoded dtype instead of using model's native precision

Even if the parameter name were correct, the logic hardcoded float16 for CUDA and float32 for CPU. Phi-4-mini-instruct's weights are natively stored as bfloat16 (per its config.json: "torch_dtype": "bfloat16").

  • float32 on CPU: Upcasts every parameter, doubling memory from ~7.6GB to ~15.2GB — leaving virtually nothing for OS + inference on a 16GB machine
  • float16 on CUDA: Narrower exponent range than the native bfloat16, risking overflow/underflow

Fix

Replaced with torch_dtype="auto", which reads the dtype directly from the model's config.json. This matches the official HuggingFace example:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",   # resolves to bfloat16 from model config
    ...
)

Why torch_dtype="auto" is the right approach

Approach CPU Memory CUDA Precision Future-proof
dtype=torch.float32 (old, broken) ~15.2 GB N/A (wrong param name) No
torch_dtype=torch.bfloat16 (hardcoded) ~7.6 GB Correct No — breaks if model changes
torch_dtype="auto" ~7.6 GB Correct Yes — reads from config.json

Changes

  • Fixed parameter name from dtype to torch_dtype in both setup_model() and main() load paths
  • Changed value to "auto" (resolves to bfloat16 from Phi-4-mini config.json)
  • Applied to both Windows and macOS inference.py
  • Bumped prep_version: Windows 9→10, macOS 5→6

Files Changed

  • scenarios/windows/pytorch_inf/pytorch_inf_resources/inference.py
  • scenarios/macos/mac_pytorch_inf/mac_pytorch_inf_resources/inference.py
  • scenarios/windows/pytorch_inf/pytorch_inf.py
  • scenarios/macos/mac_pytorch_inf/mac_pytorch_inf.py

… example and fix OOM on 16GB machines

Two bugs fixed:

1. Wrong parameter name: used 'dtype=' instead of 'torch_dtype=' for from_pretrained(), which was likely silently ignored, causing the model to load in default (float32) precision.

2. Hardcoded dtype logic: float16 for CUDA / float32 for CPU instead of using the model's native bfloat16. Phi-4-mini weights are stored as bfloat16 per config.json.

Fix: Use torch_dtype='auto' which reads the dtype from the model's config.json (bfloat16), matching the official HuggingFace example. This halves CPU memory from ~15.2GB to ~7.6GB and uses the correct precision on all devices.

Changes:

- Fixed parameter name from 'dtype' to 'torch_dtype' in both setup_model() and main()

- Changed value to 'auto' (resolves to bfloat16 from model config)

- Applied to both Windows and macOS inference.py

- Bumped prep_version: Windows 9->10, macOS 5->6
@jewilder jewilder merged commit 3315415 into microsoft:main Apr 2, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants