Skip to content

Add anymodel-core to feature/puzzletron#974

Open
danielkorzekwa wants to merge 15 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_core
Open

Add anymodel-core to feature/puzzletron#974
danielkorzekwa wants to merge 15 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_core

Conversation

@danielkorzekwa
Copy link

@danielkorzekwa danielkorzekwa commented Mar 4, 2026

What does this PR do?

  • Add converter, model_descriptor, puzzformer, and llama model support

Summary by CodeRabbit

  • New Features

    • CLI to convert HF models to AnyModel format; unified package exports; Llama support; pruning toolkit (FFN, KV-heads, MoE expert removal) with multiple init strategies and runtime hooks; per-layer patching and no-op primitives; improved checkpoint export flow.
  • Documentation

    • Comprehensive AnyModel README and SPDX/license headers added.
  • Tests

    • Expanded parameterized end-to-end tests and new tokenizer/model config resources.
  • Chores

    • Package initializers consolidated public API; lightweight dummy modules for testing.

@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 4, 2026 19:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds an AnyModel conversion and pruning subsystem: package-level converters and registries, ModelDescriptor abstractions with a Llama implementation, per-layer patcher and no-op utilities, extensive pruning mixins/hooks/utilities (FFN/KV/MoE), checkpoint/save refactor, dummy test modules, many test resources, and Megatron MoE hooks.

Changes

Cohort / File(s) Summary
AnyModel converter package
modelopt/torch/puzzletron/anymodel/converter/__init__.py, modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py, modelopt/torch/puzzletron/anymodel/converter/converter.py, modelopt/torch/puzzletron/anymodel/converter/converter_factory.py
Adds package init with license/docstring and wildcard re-exports; adds CLI convert_model; implements abstract Converter (weight-map resolution, config/weights conversion, subblock writing) and ConverterFactory registry with registration/decorator/get.
ModelDescriptor package
modelopt/torch/puzzletron/anymodel/model_descriptor/__init__.py, .../model_descriptor.py, .../model_descriptor_factory.py
Adds package init and re-exports; introduces ModelDescriptor abstract API (layer predicates, overrides, weight grouping, dummy-block helpers) and ModelDescriptorFactory registry plus pretrained resolver.
Llama concrete implementations
modelopt/torch/puzzletron/anymodel/models/llama/__init__.py, .../llama_converter.py, .../llama_model_descriptor.py
Adds LlamaConverter (registered) and LlamaModelDescriptor with Llama-specific naming/predicates, no-op post-init hooks, rotary init helper, and Llama-specific FFN intermediate descriptor dataclass.
Puzzformer no-op & patcher utilities
modelopt/torch/puzzletron/anymodel/puzzformer/__init__.py, .../no_op.py, .../utils.py
Adds no-op modules (MatchingZeros, Same, return_tuple_of_size), deci_x_patcher context manager to patch decoder init for per-layer BlockConfig application, stack-inspection helper, and override_config_with_block_configs.
AnyModel package root & docs
modelopt/torch/puzzletron/anymodel/__init__.py, modelopt/torch/puzzletron/anymodel/README.md
Adds package initializer exposing converters/descriptors/patchers/no-op utilities and convert_model; README documents conversion/compression workflows and usage examples.
Pruning core & utilities
modelopt/torch/puzzletron/pruning/pruning_mixin.py, .../pruning_utils.py, .../pruning_ckpts.py
Adds LayerDescriptor and PruningMixIn abstractions, enums for init/pruning modes, resolve_pruning_mixin, heavy-weight init helpers for MLP/attention/MoE, activations/expert-score loaders, and updates to prune-ckpt launchers to accept descriptors/pruning mixins.
Pruning mixins (FFN / KV / Expert removal)
modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py, .../kv_heads_pruning_mixin.py, .../expert_removal_pruning_mixin.py
Adds descriptors and pruning mixins for FFN intermediate layers, KV-heads, and expert-removal MoE with prune_single_layer implementations, key-generation helpers, and support for multiple init/prune modes.
Megatron hooks (MoE / voting)
modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py
Adds many forward hooks for MoE expert importance estimation and ranked-choice voting (base + model-specific variants) with accumulation, serialization, and state_dict handling.
Checkpoint & saver updates
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
Refactors checkpoint save/load to use transformers AutoConfig/PreTrainedModel, adds force_cache_dynamic_modules, integrates ModelDescriptor weight grouping, changes save_checkpoint/_save_checkpoint/save_subblocks signatures and safetensors index flow.
BlockConfig and casting utilities
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py
Modifies AttentionConfig/FFNConfig/BlockConfig fields and validations (introduces num_key_value_heads, llama4 handling), adds BlockConfig.to_dict and maybe_cast_block_configs, supports parallel_blocks casting.
Dummy modules & test helpers
modelopt/torch/puzzletron/utils/dummy_modules.py, tests/_test_utils/...
Adds DummyModule/DummyBlock/DummyWTE/DummyLMHead for tests; updates test utilities to create/save small HF models from HF configs and adjust test scaffolding.
Models package registration
modelopt/torch/puzzletron/anymodel/models/__init__.py
Adds side-effect imports to register model families (llama and others commented/available).
Bypassed training / child init
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py
Extends per-layer processing and child state dict creation to accept descriptor and pruning_mixin, adds _get_head_dim, updates typing and signatures to support optional external pruning flows.
Tests, resources & tokenizer
tests/_test_utils/..., tests/gpu/torch/puzzletron/resources/..., tests/gpu/torch/puzzletron/test_puzzletron.py
Adds many test resources (tokenizer files, HF config JSONs, YAML workflows for pruning), truncation script, updates test harness to call convert_model, parameterizes tests, and introduces new test helpers/assertions for pruning/activations.
Removed tests
tests/gpu/torch/puzzletron/decilm/converters/test_convert_llama3_config_to_decilm_config.py
Removes an older Llama->DeciLM conversion test file.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CLI as convert_model
    participant DescFactory as ModelDescriptorFactory
    participant ConvFactory as ConverterFactory
    participant Converter
    participant FS as File System

    Client->>CLI: convert_model(input_dir, output_dir, converter)
    CLI->>DescFactory: get(converter)
    DescFactory-->>CLI: ModelDescriptor
    CLI->>ConvFactory: get(converter)
    ConvFactory-->>CLI: Converter class/instance
    CLI->>Converter: convert(descriptor, input_dir, output_dir)
    Converter->>FS: copy_checkpoint_files(input_dir, output_dir)
    Converter->>FS: read/write configs
    Converter->>Converter: convert_configs_in_dirs()
    Converter->>FS: read weight map (index or single file)
    Converter->>Converter: convert_model_weights()
    Converter->>FS: write subblock safetensors and index
    Converter-->>CLI: conversion complete
    CLI-->>Client: success
Loading
sequenceDiagram
    participant Layer as Decoder Layer Init
    participant Patcher as deci_x_patcher
    participant Stack as Call Stack
    participant Descriptor as ModelDescriptor
    participant Config as Layer Config

    Patcher->>Patcher: backup original __init__
    Patcher->>Patcher: install patched __init__
    Layer->>Patcher: __init__ invoked
    Patcher->>Stack: inspect caller stack for layer index
    Stack-->>Patcher: layer_index
    Patcher->>Descriptor: block_config_to_layer_overrides(block_config[layer_index])
    Descriptor-->>Patcher: overrides dict
    Patcher->>Config: override_config_with_block_configs(config, overrides)
    Config-->>Patcher: merged_config
    Patcher->>Patcher: call original __init__(merged_config)
    Patcher->>Descriptor: apply attn/mlp no-op post-init if configured
    Patcher->>Patcher: restore original __init__ on exit
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dkorzekwa/anymodel_core

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add converter, model_descriptor, puzzformer, and llama model support
- Selective merge of anymodel functionality

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa force-pushed the dkorzekwa/anymodel_core branch from c45647f to e82164f Compare March 4, 2026 19:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py (1)

89-120: Consider caching layer_name_predicates result for repeated calls.

The method creates 2 + 2*num_layers compiled regex patterns on each call. For models with many layers (e.g., 80+ layers), this generates 160+ patterns. If get_weight_groups is called multiple times with the same num_layers, caching could improve performance.

This is a minor optimization since conversion is typically a one-time operation.

♻️ Optional caching approach
from functools import lru_cache

`@staticmethod`
`@lru_cache`(maxsize=4)
def layer_name_predicates(num_layers: int) -> Dict[str, re.Pattern]:
    # ... existing implementation
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`
around lines 89 - 120, The layer_name_predicates method recompiles the same
regexes on every call which is wasteful for repeated get_weight_groups runs;
make layer_name_predicates cached by adding functools.lru_cache (or similar) to
the static method so calls with the same num_layers return the prebuilt Dict
without rebuilding build_ffn_predicates/build_attention_predicates; import
lru_cache, apply it to layer_name_predicates (keeping it static) and ensure the
return type remains Dict[str, re.Pattern].
modelopt/torch/puzzletron/anymodel/converter/converter.py (1)

116-131: Consider using logging instead of print for library code.

Similar to the descriptor factory, the converter uses print() statements for progress output. Using Python's logging module would provide better control for library consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 116 -
131, The converter uses print() calls for progress in the block that saves
subblocks and writes model.safetensors.index.json (see the prints around the
loop over tensors, the "subblock_file" usage, and the final print referencing
output_dir); replace these prints with a module logger (e.g., add import logging
and logger = logging.getLogger(__name__)) and call appropriate logger methods
(logger.info or logger.debug) for the group/layer lines and the final status
line so consumers can configure verbosity; keep the rest of the behavior (saving
via save_file, updating new_index, writing the JSON) unchanged.
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)

49-51: Consider using logging instead of print for library code.

Using print() in library code makes it difficult for consumers to control output verbosity. Consider using Python's logging module for better configurability.

♻️ Suggested change
+import logging
+
+logger = logging.getLogger(__name__)
+
 # ... in the function:
-        print(
-            f"[resolve_descriptor_from_pretrained] Auto-detected model_type='{model_type}' → descriptor='{detected}'"
-        )
+        logger.info(
+            f"Auto-detected model_type='{model_type}' → descriptor='{detected}'"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`
around lines 49 - 51, The print call in resolve_descriptor_from_pretrained
should be replaced with the logging module so callers can control verbosity;
import logging (or use module-level logger = logging.getLogger(__name__)) and
change the print(...) that reports model_type and detected to a logger.debug or
logger.info call (include model_type and detected in the message) so the same
message is emitted via Python logging rather than stdout.
modelopt/torch/puzzletron/anymodel/puzzformer/utils.py (1)

32-40: Stack inspection is fragile but acceptable for this internal use case.

The _get_variable_from_stack function relies on inspecting caller stack frames for layer_idx or idx variables. This works but is inherently fragile:

  • Refactoring the caller code could break this silently
  • Variable renaming would cause RuntimeError

Consider documenting the expected caller contract or using a thread-local context as an alternative if stability becomes a concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 32 - 40,
The _get_variable_from_stack function relies on fragile stack inspection for
names like "layer_idx" and "idx"; update its docstring to explicitly document
the caller contract (which variable names callers must set) and the expected
call pattern, and as an alternative implement a stable thread-local context API
(e.g., functions to set/get a current layer index using threading.local) and
update callers to use that API instead of relying on stack variables so
refactors/renames won't silently break; reference _get_variable_from_stack and
the "layer_idx"/"idx" identifiers when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py`:
- Around line 31-62: The converter parameter is documented as accepting a
Converter class but the implementation calls ModelDescriptorFactory.get and
ConverterFactory.get and will break if a class is passed; change the API
contract to require a registered converter name (string) by validating that
converter is an instance of str at the start of convert_model and raising a
clear TypeError if not, update the docstring to state "converter is a registered
converter name (str)" and adjust any variable names/comments around
ModelDescriptorFactory.get and ConverterFactory.get to reflect that they expect
a registry key rather than a class.

In `@modelopt/torch/puzzletron/anymodel/converter/converter_factory.py`:
- Around line 70-75: The get method in converter_factory.py currently returns
unresolved strings which leads to late/opaque failures; change get(cls, value:
str | ModelDescriptor) to validate string names against CLASS_MAPPING and raise
a clear exception (e.g., ValueError or KeyError) when value is a str that is not
present in cls.CLASS_MAPPING, including the unknown name in the message; keep
returning the original object when value is already a ModelDescriptor/resolved
converter. Ensure you update the behavior of get and its docstring accordingly
so callers receive an immediate, descriptive failure for unknown converter
names.

In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 100-113: The MoE conversion is indexing the source checkpoint dict
using converted_name which can differ from the original checkpoint keys; update
the branch in converter.py (inside the loop that handles MoE packed tensors) to
read from data using the original parameter name (the variable name) when
fetching the "_blocks" and "_scales" entries, and continue to use converted_name
only as the key for storing the result into tensors (i.e., use name to build the
data lookups like the "_blocks"/"_scales" access and assign
tensors[converted_name] = convert_moe_packed_tensors(...)).

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 39-44: The defaulting of trust_remote_code to True in
resolve_descriptor_from_pretrained is unsafe; change the function signature to
default trust_remote_code to False and ensure the call to
AutoConfig.from_pretrained(pretrained, trust_remote_code=trust_remote_code)
continues to pass the parameter through so callers must opt in to remote code
execution; update any related callers/tests that relied on the old True default
to explicitly pass True when they intend to trust remote code.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py`:
- Around line 44-50: The code currently builds a single BlockConfig dict
(block_config) and repeats its reference via [block_config] * num_hidden_layers
causing shared aliasing across layers; change this to produce independent dicts
per layer (e.g., build each layer's config inside a list comprehension or use a
deepcopy per iteration) so that block_configs is a list of distinct dicts,
referencing the BlockConfig/AttentionConfig/FFNConfig construction used earlier
(BlockConfig, block_config, block_configs, num_hidden_layers).

In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py`:
- Around line 43-47: Wrapped.forward currently does outputs[0] = result[0] which
breaks when result is a torch.Tensor (it slices the batch); change the
assignment to check the type of result and set outputs[0] to the entire tensor
when result is a Tensor, otherwise keep the original tuple/list element
semantics. Update the logic in the forward method of the wrapper
(function/method name forward, variables result and outputs, and size) to use an
isinstance(result, torch.Tensor) guard (import torch if needed) and assign
outputs[0] = result for tensors, else outputs[0] = result[0].

---

Nitpick comments:
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 116-131: The converter uses print() calls for progress in the
block that saves subblocks and writes model.safetensors.index.json (see the
prints around the loop over tensors, the "subblock_file" usage, and the final
print referencing output_dir); replace these prints with a module logger (e.g.,
add import logging and logger = logging.getLogger(__name__)) and call
appropriate logger methods (logger.info or logger.debug) for the group/layer
lines and the final status line so consumers can configure verbosity; keep the
rest of the behavior (saving via save_file, updating new_index, writing the
JSON) unchanged.

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 49-51: The print call in resolve_descriptor_from_pretrained should
be replaced with the logging module so callers can control verbosity; import
logging (or use module-level logger = logging.getLogger(__name__)) and change
the print(...) that reports model_type and detected to a logger.debug or
logger.info call (include model_type and detected in the message) so the same
message is emitted via Python logging rather than stdout.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`:
- Around line 89-120: The layer_name_predicates method recompiles the same
regexes on every call which is wasteful for repeated get_weight_groups runs;
make layer_name_predicates cached by adding functools.lru_cache (or similar) to
the static method so calls with the same num_layers return the prebuilt Dict
without rebuilding build_ffn_predicates/build_attention_predicates; import
lru_cache, apply it to layer_name_predicates (keeping it static) and ensure the
return type remains Dict[str, re.Pattern].

In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 32-40: The _get_variable_from_stack function relies on fragile
stack inspection for names like "layer_idx" and "idx"; update its docstring to
explicitly document the caller contract (which variable names callers must set)
and the expected call pattern, and as an alternative implement a stable
thread-local context API (e.g., functions to set/get a current layer index using
threading.local) and update callers to use that API instead of relying on stack
variables so refactors/renames won't silently break; reference
_get_variable_from_stack and the "layer_idx"/"idx" identifiers when making these
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce8becd5-0dda-4383-8211-5200a7397c00

📥 Commits

Reviewing files that changed from the base of the PR and between 5812777 and c45647f.

📒 Files selected for processing (13)
  • modelopt/torch/puzzletron/anymodel/converter/__init__.py
  • modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py
  • modelopt/torch/puzzletron/anymodel/converter/converter.py
  • modelopt/torch/puzzletron/anymodel/converter/converter_factory.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/__init__.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
  • modelopt/torch/puzzletron/anymodel/models/llama/__init__.py
  • modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py
  • modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/__init__.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/utils.py

Comment on lines +31 to +62
converter: Converter | str,
):
"""Convert a HuggingFace model to AnyModel format.

This function converts a HuggingFace checkpoint to the AnyModel format used
for compression. The conversion process:

1. Copies non-weight files (config, tokenizer, etc.)
2. Creates block_configs for each layer
3. Reorganizes weights into subblock checkpoints

Args:
input_dir: Path to the input HuggingFace checkpoint directory.
output_dir: Path to the output AnyModel checkpoint directory.
converter: Either a converter name (e.g., "llama") or a Converter class.

Example:
>>> convert_model(
... input_dir="/path/to/Llama-3.1-8B-Instruct",
... output_dir="/path/to/output/ckpts/teacher",
... converter="llama",
... )
"""
input_dir = Path(input_dir)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

# Get descriptor and converter from factories (they use the same name)
descriptor = ModelDescriptorFactory.get(converter)
converter = ConverterFactory.get(converter)

converter.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

converter API contract is currently inconsistent with implementation.

The function documents support for passing a Converter class, but the current resolution path expects a registry key and can produce an invalid descriptor object when a class is passed.

✅ Proposed fix (restrict and validate)
 def convert_model(
     input_dir: str,
     output_dir: str,
-    converter: Converter | str,
+    converter: str,
 ):
@@
-    descriptor = ModelDescriptorFactory.get(converter)
-    converter = ConverterFactory.get(converter)
-
-    converter.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir)
+    descriptor = ModelDescriptorFactory.get(converter)
+    converter_cls = ConverterFactory.get(converter)
+    converter_cls.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir)

Also update the docstring to state converter is a registered converter name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py` around
lines 31 - 62, The converter parameter is documented as accepting a Converter
class but the implementation calls ModelDescriptorFactory.get and
ConverterFactory.get and will break if a class is passed; change the API
contract to require a registered converter name (string) by validating that
converter is an instance of str at the start of convert_model and raising a
clear TypeError if not, update the docstring to state "converter is a registered
converter name (str)" and adjust any variable names/comments around
ModelDescriptorFactory.get and ConverterFactory.get to reflect that they expect
a registry key rather than a class.

Comment on lines +70 to +75
def get(cls, value: str | ModelDescriptor):
"""Get a registered converter by name or return the converter if already resolved."""
if isinstance(value, str):
if value in cls.CLASS_MAPPING:
return cls.CLASS_MAPPING[value]
return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on unknown converter names in the factory.

Returning unresolved strings from get causes late, opaque runtime errors in downstream conversion calls.

🛠️ Proposed fix
     `@classmethod`
     def get(cls, value: str | ModelDescriptor):
         """Get a registered converter by name or return the converter if already resolved."""
         if isinstance(value, str):
-            if value in cls.CLASS_MAPPING:
-                return cls.CLASS_MAPPING[value]
+            if value in cls.CLASS_MAPPING:
+                return cls.CLASS_MAPPING[value]
+            available = ", ".join(sorted(cls.CLASS_MAPPING.keys()))
+            raise KeyError(f"Unknown converter `{value}`. Available converters: [{available}]")
         return value
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/converter/converter_factory.py` around
lines 70 - 75, The get method in converter_factory.py currently returns
unresolved strings which leads to late/opaque failures; change get(cls, value:
str | ModelDescriptor) to validate string names against CLASS_MAPPING and raise
a clear exception (e.g., ValueError or KeyError) when value is a str that is not
present in cls.CLASS_MAPPING, including the unknown name in the message; keep
returning the original object when value is already a ModelDescriptor/resolved
converter. Ensure you update the behavior of get and its docstring accordingly
so callers receive an immediate, descriptive failure for unknown converter
names.

Comment on lines +100 to +113
# Convert MoE packed tensors if quantized is mxfp4 //gpt-oss-20b
if getattr(cls, "quantized", None) == "mxfp4":
if name.endswith("_blocks"):
converted_name = converted_name.replace("_blocks", "")
tensors[converted_name] = convert_moe_packed_tensors(
data[converted_name + "_blocks"],
data[converted_name + "_scales"],
)
elif name.endswith("_scales"):
continue
else:
tensors[converted_name] = data[name]
else:
tensors[converted_name] = data[name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's locate and examine the converter.py file
fd -t f "converter.py" -p "modelopt/torch/puzzletron"

Repository: NVIDIA/Model-Optimizer

Length of output: 190


🏁 Script executed:

#!/bin/bash
# Examine the converter.py file around lines 100-113 with broader context
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | head -150 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4863


🏁 Script executed:

#!/bin/bash
# Search for convert_moe_packed_tensors function definition and usage
rg -n "def convert_moe_packed_tensors|convert_moe_packed_tensors" --type=py -B2 -A5

Repository: NVIDIA/Model-Optimizer

Length of output: 1960


🏁 Script executed:

#!/bin/bash
# Look for tests related to MoE conversion
fd -t f "test_*.py" -o "*_test.py" | xargs rg -l "moe|MoE" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 5021


🏁 Script executed:

#!/bin/bash
# Search for convert_weight_name implementation
rg -n "def convert_weight_name" --type=py -A10

Repository: NVIDIA/Model-Optimizer

Length of output: 1214


🏁 Script executed:

#!/bin/bash
# Search for subclasses that override convert_weight_name
rg -n "def convert_weight_name" --type=py -A5

Repository: NVIDIA/Model-Optimizer

Length of output: 742


🏁 Script executed:

#!/bin/bash
# Search for mxfp4 quantization usage and MoE conversion tests
rg -n "mxfp4|convert_moe_packed_tensors" --type=py -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 19846


🏁 Script executed:

#!/bin/bash
# Look for llama converter which might use the base Converter
cat -n modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2273


🏁 Script executed:

#!/bin/bash
# Search for actual MoE checkpoint structure documentation or examples
rg -n "moe.*blocks.*scales|_blocks.*_scales" --type=py -B2 -A2 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1230


🏁 Script executed:

#!/bin/bash
# Check if there are any actual tests that exercise the MoE conversion path with real data
rg -n "convert_model_weights.*mxfp4|gpt-oss-20b" --type=py -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 3999


🏁 Script executed:

#!/bin/bash
# Verify the logic by examining the exact flow: what names come from param_to_file
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | sed -n '65,130p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3434


🏁 Script executed:

#!/bin/bash
# Check how param_to_file is built to understand what names it contains
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | sed -n '43,63p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1158


🏁 Script executed:

#!/bin/bash
# Verify by checking if any existing code might override convert_weight_name for MoE models
rg -n "class.*Converter.*:|quantized.*=" modelopt/torch/puzzletron --type=py -A5 | grep -A5 "quantized"

Repository: NVIDIA/Model-Optimizer

Length of output: 3733


🏁 Script executed:

#!/bin/bash
# Final verification: check the exact structure of what gets passed to convert_moe_packed_tensors
rg -n "def convert_moe_packed_tensors" --type=py -A10 -B2

Repository: NVIDIA/Model-Optimizer

Length of output: 48


Use original parameter name to construct data dict keys in MoE conversion logic.

Lines 105-106 access data[converted_name + "_blocks"] and data[converted_name + "_scales"], but the data dict is keyed by original checkpoint parameter names. The converted_name is derived from applying convert_weight_name() transformation, so any subclass override that transforms the parameter prefix would cause a KeyError at runtime.

Fix: Construct data keys using the original name parameter:

data[name],
data[name.replace("_blocks", "_scales")]

Reserve converted_name for storing results in the tensors output dict. This ensures data access always uses original checkpoint names while transformations are applied only to output keys.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 100 -
113, The MoE conversion is indexing the source checkpoint dict using
converted_name which can differ from the original checkpoint keys; update the
branch in converter.py (inside the loop that handles MoE packed tensors) to read
from data using the original parameter name (the variable name) when fetching
the "_blocks" and "_scales" entries, and continue to use converted_name only as
the key for storing the result into tensors (i.e., use name to build the data
lookups like the "_blocks"/"_scales" access and assign tensors[converted_name] =
convert_moe_packed_tensors(...)).

Comment on lines +44 to +50
block_config = BlockConfig(
attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads),
ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
).to_dict()

block_configs = [block_config] * num_hidden_layers
return block_configs
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid shared dict aliasing across all layer block configs.

[block_config] * num_hidden_layers duplicates references, not values. A mutation in one layer config will affect all layers.

♻️ Proposed fix
-        block_config = BlockConfig(
-            attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads),
-            ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
-        ).to_dict()
-
-        block_configs = [block_config] * num_hidden_layers
+        block_configs = [
+            BlockConfig(
+                attention=AttentionConfig(
+                    no_op=False, num_key_value_heads=config.num_key_value_heads
+                ),
+                ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
+            ).to_dict()
+            for _ in range(num_hidden_layers)
+        ]
         return block_configs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py` around
lines 44 - 50, The code currently builds a single BlockConfig dict
(block_config) and repeats its reference via [block_config] * num_hidden_layers
causing shared aliasing across layers; change this to produce independent dicts
per layer (e.g., build each layer's config inside a list comprehension or use a
deepcopy per iteration) so that block_configs is a list of distinct dicts,
referencing the BlockConfig/AttentionConfig/FFNConfig construction used earlier
(BlockConfig, block_config, block_configs, num_hidden_layers).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa is this fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)

39-44: ⚠️ Potential issue | 🔴 Critical

CRITICAL: trust_remote_code defaults to True, violating security guidelines.

Per coding guidelines and SECURITY.md, trust_remote_code=True must not be hardcoded. This flag enables execution of arbitrary Python shipped with a checkpoint, which is an RCE vector if the model source is untrusted. The parameter should default to False.

🔒 Proposed fix
-def resolve_descriptor_from_pretrained(pretrained: str | None, trust_remote_code: bool = True):
+def resolve_descriptor_from_pretrained(pretrained: str | None, trust_remote_code: bool = False):
     """Resolve the model descriptor by loading the checkpoint config and mapping model_type."""

Based on learnings: "trust_remote_code parameter for transformers model or tokenizer loading must be exposed as a caller-configurable parameter defaulting to False, not hardcoded to True"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`
around lines 39 - 44, The function resolve_descriptor_from_pretrained currently
hardcodes trust_remote_code=True which is a security risk; change its default to
False and ensure the value is forwarded to AutoConfig.from_pretrained so callers
must explicitly opt-in when they trust the remote code. Update the function
signature trust_remote_code: bool = False in
model_descriptor_factory.resolve_descriptor_from_pretrained and keep using that
variable in the AutoConfig.from_pretrained call; audit any callsites of
resolve_descriptor_from_pretrained and only pass trust_remote_code=True when the
model source is trusted.
modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py (1)

44-50: ⚠️ Potential issue | 🟠 Major

Avoid shared dict aliasing across all layer block configs.

[block_config] * num_hidden_layers creates a list where all elements reference the same dict object. Any mutation to one layer's config would affect all layers.

♻️ Proposed fix
-        block_config = BlockConfig(
-            attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads),
-            ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
-        ).to_dict()
-
-        block_configs = [block_config] * num_hidden_layers
+        block_configs = [
+            BlockConfig(
+                attention=AttentionConfig(
+                    no_op=False, num_key_value_heads=config.num_key_value_heads
+                ),
+                ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
+            ).to_dict()
+            for _ in range(num_hidden_layers)
+        ]
         return block_configs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py` around
lines 44 - 50, The code builds a single block_config dict and repeats it with
[block_config] * num_hidden_layers causing aliasing; instead produce distinct
dicts per layer by creating a fresh BlockConfig.to_dict() for each hidden layer
(e.g., use a list comprehension or loop) so that block_configs =
[BlockConfig(attention=AttentionConfig(...), ffn=FFNConfig(...)).to_dict() for _
in range(num_hidden_layers)] or at minimum use shallow/deep copy of block_config
per iteration; update the code around BlockConfig, AttentionConfig, FFNConfig,
block_config and block_configs to avoid shared references.
modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py (1)

42-47: ⚠️ Potential issue | 🔴 Critical

Fix wrapper output handling for tensor-returning modules.

Wrapped.forward unconditionally accesses result[0], but when wrapping MatchingZeros, the result is a tensor (not a tuple). This causes result[0] to slice the batch dimension instead of returning the full tensor.

🐛 Proposed fix
     class Wrapped(cls):
         def forward(self, *args, **kwargs):
             result = super().forward(*args, **kwargs)
+            first_output = result[0] if isinstance(result, (tuple, list)) else result
             outputs = [None] * size
-            outputs[0] = result[0]
+            outputs[0] = first_output
             return tuple(outputs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py` around lines 42 - 47,
Wrapped.forward assumes result is subscriptable tuple and does result[0], which
breaks when the wrapped module (e.g., MatchingZeros) returns a tensor; change
forward in class Wrapped to detect whether result is a tuple/list
(isinstance(result, (tuple, list))) and if so set outputs[0] = result[0],
otherwise set outputs[0] = result (for single-tensor returns) before returning
tuple(outputs); ensure this logic uses the existing variables (Wrapped.forward,
result, outputs, size) so the wrapper returns a tuple of length size in all
cases.
modelopt/torch/puzzletron/anymodel/converter/converter.py (1)

100-113: ⚠️ Potential issue | 🟠 Major

Use original parameter name to construct data dict keys in MoE conversion logic.

Lines 105-106 access data[converted_name + "_blocks"] and data[converted_name + "_scales"], but the data dict is keyed by original checkpoint parameter names. If convert_weight_name() transforms the name, this would cause a KeyError at runtime.

🐛 Proposed fix
                         if name.endswith("_blocks"):
-                                converted_name = converted_name.replace("_blocks", "")
+                                base_name = name.replace("_blocks", "")
+                                converted_name = cls.convert_weight_name(base_name)
                                 tensors[converted_name] = convert_moe_packed_tensors(
-                                    data[converted_name + "_blocks"],
-                                    data[converted_name + "_scales"],
+                                    data[name],
+                                    data[base_name + "_scales"],
                                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 100 -
113, The MoE conversion uses converted_name to build keys into the original
checkpoint dict which can cause KeyError if convert_weight_name changed the
name; update the logic in the block that checks getattr(cls, "quantized", None)
== "mxfp4" so that when calling convert_moe_packed_tensors you reference the
original parameter names (use name + "_blocks" and name + "_scales") instead of
converted_name + "_blocks"/"_scales", and keep the rest of the assignment to
tensors[converted_name] intact; adjust the elif handling for
name.endswith("_scales") to continue as before and ensure
convert_moe_packed_tensors is invoked with data[name + "_blocks"], data[name +
"_scales"].
🧹 Nitpick comments (3)
modelopt/torch/puzzletron/anymodel/puzzformer/utils.py (1)

32-40: Stack inspection is fragile; consider documenting expected calling convention.

_get_variable_from_stack relies on callers having layer_idx or idx in their local scope. This works but is fragile if calling code changes. Consider adding a docstring note about the expected variable names in calling contexts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 32 - 40,
The helper _get_variable_from_stack currently searches caller frames for local
names "layer_idx" or "idx", which is fragile; update its docstring to document
the expected calling convention and the exact local variable names it looks for
(e.g., "layer_idx" and "idx"), including why those names are required and
advising callers to set one of them in scope (or to call an alternative API) if
they cannot; mention the function name _get_variable_from_stack and the expected
variable names in the docstring so future maintainers know the dependency.
modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py (2)

15-15: Avoid file-wide mypy suppression in a new public descriptor module.

Line 15 (# mypy: ignore-errors) suppresses all type diagnostics and can mask interface drift for ModelDescriptor overrides. Prefer localized # type: ignore[...] on unavoidable lines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py` at
line 15, Remove the file-wide "# mypy: ignore-errors" from
llama_model_descriptor.py and instead fix type issues by making the class and
method signatures conform to the ModelDescriptor interface (e.g., ensure
LlamaModelDescriptor methods match ModelDescriptor's parameter and return
types); if there are specific unavoidable mypy failures, add localized comments
like "# type: ignore[assignment]" or "# type: ignore[arg-type]" only on the
offending lines (target the exact methods/attributes in LlamaModelDescriptor or
other ModelDescriptor overrides) and run mypy to verify no other hidden issues
remain.

60-72: Add inline comments documenting the Transformers API contracts for clarity.

The size=2 at line 62 correctly matches LlamaAttention.forward() return structure (attn_output, attn_weights), and the LlamaRotaryEmbedding constructor at line 71 uses the correct current signature. Both align with the pinned Transformers versions (4.48–5.0 range). While the codebase already documents these patterns in ModelDescriptor and no_op.py, adding brief inline comments here would make the API contract assumptions explicit for future maintainers. Consider:

decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)()  # LlamaAttention returns (attn_output, attn_weights)

and

model.model.rotary_emb = LlamaRotaryEmbedding(model.config, runtime.device)  # LlamaRotaryEmbedding(config, device) constructor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`
around lines 60 - 72, Update attn_no_op_post_init and init_rotary_embedding with
brief inline comments documenting the Transformers API contracts: next to
decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)() add a
comment that LlamaAttention.forward returns (attn_output, attn_weights) to
justify size=2; next to model.model.rotary_emb =
LlamaRotaryEmbedding(model.config, runtime.device) add a comment that the
constructor signature is LlamaRotaryEmbedding(config, device). Keep comments
concise and place them immediately after the referenced expressions in the
attn_no_op_post_init and init_rotary_embedding functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 114-122: The inline comment in override_config_with_block_configs
has a typo ("Model initialization requires fails with None"); update it to a
clear, concise statement such as "Model initialization requires non-None values;
filter out None overrides" (or similar) to explain why block_configs entries
that are None are removed before updating _config; keep the rest of the function
unchanged and ensure the comment appears above the _config_overrides line
referencing the None-filtering behavior.

---

Duplicate comments:
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 100-113: The MoE conversion uses converted_name to build keys into
the original checkpoint dict which can cause KeyError if convert_weight_name
changed the name; update the logic in the block that checks getattr(cls,
"quantized", None) == "mxfp4" so that when calling convert_moe_packed_tensors
you reference the original parameter names (use name + "_blocks" and name +
"_scales") instead of converted_name + "_blocks"/"_scales", and keep the rest of
the assignment to tensors[converted_name] intact; adjust the elif handling for
name.endswith("_scales") to continue as before and ensure
convert_moe_packed_tensors is invoked with data[name + "_blocks"], data[name +
"_scales"].

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 39-44: The function resolve_descriptor_from_pretrained currently
hardcodes trust_remote_code=True which is a security risk; change its default to
False and ensure the value is forwarded to AutoConfig.from_pretrained so callers
must explicitly opt-in when they trust the remote code. Update the function
signature trust_remote_code: bool = False in
model_descriptor_factory.resolve_descriptor_from_pretrained and keep using that
variable in the AutoConfig.from_pretrained call; audit any callsites of
resolve_descriptor_from_pretrained and only pass trust_remote_code=True when the
model source is trusted.

In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py`:
- Around line 44-50: The code builds a single block_config dict and repeats it
with [block_config] * num_hidden_layers causing aliasing; instead produce
distinct dicts per layer by creating a fresh BlockConfig.to_dict() for each
hidden layer (e.g., use a list comprehension or loop) so that block_configs =
[BlockConfig(attention=AttentionConfig(...), ffn=FFNConfig(...)).to_dict() for _
in range(num_hidden_layers)] or at minimum use shallow/deep copy of block_config
per iteration; update the code around BlockConfig, AttentionConfig, FFNConfig,
block_config and block_configs to avoid shared references.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py`:
- Around line 42-47: Wrapped.forward assumes result is subscriptable tuple and
does result[0], which breaks when the wrapped module (e.g., MatchingZeros)
returns a tensor; change forward in class Wrapped to detect whether result is a
tuple/list (isinstance(result, (tuple, list))) and if so set outputs[0] =
result[0], otherwise set outputs[0] = result (for single-tensor returns) before
returning tuple(outputs); ensure this logic uses the existing variables
(Wrapped.forward, result, outputs, size) so the wrapper returns a tuple of
length size in all cases.

---

Nitpick comments:
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`:
- Line 15: Remove the file-wide "# mypy: ignore-errors" from
llama_model_descriptor.py and instead fix type issues by making the class and
method signatures conform to the ModelDescriptor interface (e.g., ensure
LlamaModelDescriptor methods match ModelDescriptor's parameter and return
types); if there are specific unavoidable mypy failures, add localized comments
like "# type: ignore[assignment]" or "# type: ignore[arg-type]" only on the
offending lines (target the exact methods/attributes in LlamaModelDescriptor or
other ModelDescriptor overrides) and run mypy to verify no other hidden issues
remain.
- Around line 60-72: Update attn_no_op_post_init and init_rotary_embedding with
brief inline comments documenting the Transformers API contracts: next to
decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)() add a
comment that LlamaAttention.forward returns (attn_output, attn_weights) to
justify size=2; next to model.model.rotary_emb =
LlamaRotaryEmbedding(model.config, runtime.device) add a comment that the
constructor signature is LlamaRotaryEmbedding(config, device). Keep comments
concise and place them immediately after the referenced expressions in the
attn_no_op_post_init and init_rotary_embedding functions.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 32-40: The helper _get_variable_from_stack currently searches
caller frames for local names "layer_idx" or "idx", which is fragile; update its
docstring to document the expected calling convention and the exact local
variable names it looks for (e.g., "layer_idx" and "idx"), including why those
names are required and advising callers to set one of them in scope (or to call
an alternative API) if they cannot; mention the function name
_get_variable_from_stack and the expected variable names in the docstring so
future maintainers know the dependency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1dc323b-8f82-4f6a-a989-849919373f3b

📥 Commits

Reviewing files that changed from the base of the PR and between c45647f and e82164f.

📒 Files selected for processing (13)
  • modelopt/torch/puzzletron/anymodel/converter/__init__.py
  • modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py
  • modelopt/torch/puzzletron/anymodel/converter/converter.py
  • modelopt/torch/puzzletron/anymodel/converter/converter_factory.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/__init__.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
  • modelopt/torch/puzzletron/anymodel/models/llama/__init__.py
  • modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py
  • modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/__init__.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py
  • modelopt/torch/puzzletron/anymodel/puzzformer/utils.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • modelopt/torch/puzzletron/anymodel/puzzformer/init.py
  • modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py
  • modelopt/torch/puzzletron/anymodel/converter/converter_factory.py
  • modelopt/torch/puzzletron/anymodel/models/llama/init.py

Comment on lines +114 to +122
def override_config_with_block_configs(
config: PretrainedConfig, block_configs: Dict[str, Any]
) -> PretrainedConfig:
"""Create a copy of config with block_config overrides applied."""
_config = copy.deepcopy(config)
# Model initialization requires fails with None in case of no-ops
_config_overrides = {k: v for k, v in block_configs.items() if v is not None}
_config.update(_config_overrides)
return _config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in comment.

Line 119 comment reads "Model initialization requires fails with None" which appears to be a typo. Consider clarifying.

📝 Proposed fix
     _config = copy.deepcopy(config)
-    # Model initialization requires fails with None in case of no-ops
+    # Model initialization fails with None values in case of no-ops
     _config_overrides = {k: v for k, v in block_configs.items() if v is not None}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def override_config_with_block_configs(
config: PretrainedConfig, block_configs: Dict[str, Any]
) -> PretrainedConfig:
"""Create a copy of config with block_config overrides applied."""
_config = copy.deepcopy(config)
# Model initialization requires fails with None in case of no-ops
_config_overrides = {k: v for k, v in block_configs.items() if v is not None}
_config.update(_config_overrides)
return _config
def override_config_with_block_configs(
config: PretrainedConfig, block_configs: Dict[str, Any]
) -> PretrainedConfig:
"""Create a copy of config with block_config overrides applied."""
_config = copy.deepcopy(config)
# Model initialization fails with None values in case of no-ops
_config_overrides = {k: v for k, v in block_configs.items() if v is not None}
_config.update(_config_overrides)
return _config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 114 -
122, The inline comment in override_config_with_block_configs has a typo ("Model
initialization requires fails with None"); update it to a clear, concise
statement such as "Model initialization requires non-None values; filter out
None overrides" (or similar) to explain why block_configs entries that are None
are removed before updating _config; keep the rest of the function unchanged and
ensure the comment appears above the _config_overrides line referencing the
None-filtering behavior.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.10%. Comparing base (5812777) to head (ee8f538).

Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/puzzletron     #974      +/-   ##
======================================================
- Coverage               72.12%   72.10%   -0.03%     
======================================================
  Files                     209      209              
  Lines                   23628    23628              
======================================================
- Hits                    17042    17036       -6     
- Misses                   6586     6592       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 5, 2026 12:08
@danielkorzekwa danielkorzekwa requested a review from AAnoosheh March 5, 2026 12:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
modelopt/torch/puzzletron/pruning/pruning_ckpts.py (3)

319-337: ⚠️ Potential issue | 🟠 Major

Handle list-valued pruning mixins before type dispatch.

Line 319 resolves pruning mixins dynamically, but Lines 332-337 only branch on single-instance isinstance(...). If resolver returns a list, valid configs will hit the NotImplementedError path.

Proposed fix
     cfg.pruning.pruning_mixin = resolve_pruning_mixin(cfg.pruning.pruning_mixin, cfg.descriptor)
     pruning_mixin = cfg.pruning.pruning_mixin
+    if isinstance(pruning_mixin, list):
+        if len(pruning_mixin) != 1:
+            raise NotImplementedError(
+                f"checkpoint pruning currently supports exactly one pruning mixin, got {len(pruning_mixin)}"
+            )
+        pruning_mixin = pruning_mixin[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 319 - 337,
The code currently assumes resolve_pruning_mixin(cfg.pruning.pruning_mixin,
cfg.descriptor) returns a single object (pruning_mixin) but may return a list;
update the dispatch to handle list-valued mixins by detecting if pruning_mixin
is a list/tuple and iterating over its elements, calling the appropriate
launcher for each element (use the existing launch_ffn_intermediates_prune_ckpt,
launch_attn_groups_prune_ckpt, and launch_experts_prune_ckpt), otherwise fall
back to the single-instance isinstance checks; also validate element types and
raise a clear NotImplementedError for unsupported mixin types so configs that
return lists are processed correctly.

309-310: ⚠️ Potential issue | 🟡 Minor

Create puzzle_dir/ckpts before symlink creation in MOE FFN path.

Line 310 creates a symlink under cfg.puzzle_dir/ckpts without ensuring the directory exists. This can fail on first run.

Proposed fix
-        # Create symlink in puzzle_dir/ckpts
-        os.symlink(output_dir, os.path.join(cfg.puzzle_dir, "ckpts", dirname))
+        # Create symlink in puzzle_dir/ckpts
+        ckpt_path = os.path.join(cfg.puzzle_dir, "ckpts")
+        os.makedirs(ckpt_path, exist_ok=True)
+        os.symlink(output_dir, os.path.join(ckpt_path, dirname))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 309 - 310,
The symlink creation using os.symlink(output_dir, os.path.join(cfg.puzzle_dir,
"ckpts", dirname)) can fail if cfg.puzzle_dir/ckpts does not exist; before
calling os.symlink in pruning_ckpts.py ensure the target directory exists by
creating cfg.puzzle_dir/ckpts (e.g., call
os.makedirs(os.path.join(cfg.puzzle_dir, "ckpts"), exist_ok=True) or equivalent)
and then create the symlink, keeping the same variables output_dir and dirname.

179-203: ⚠️ Potential issue | 🟠 Major

Pass a dict (not a JSON string) to model_config_overrides_dict in hidden-size pruning.

Line 179 serializes overrides with json.dumps(...), but Line 202 passes that value to model_config_overrides_dict. Other launch paths pass dictionaries; this hidden-dim path is likely sending the wrong type and can break override application.

Proposed fix
-        model_config_overrides_json = json.dumps(
-            {
-                "hidden_size": hidden_size,
-                "ffn": [
-                    {
-                        "intermediate_size": intermediate_size,
-                    }
-                    for intermediate_size in intermediate_sizes
-                ],
-            }
-        )
+        model_config_overrides_dict = {
+            "hidden_size": hidden_size,
+            "ffn": [
+                {"intermediate_size": intermediate_size}
+                for intermediate_size in intermediate_sizes
+            ],
+        }
@@
-        mprint(f"Model config overrides: {model_config_overrides_json}")
+        mprint(f"Model config overrides: {json.dumps(model_config_overrides_dict)}")
@@
-            model_config_overrides_dict=model_config_overrides_json,
+            model_config_overrides_dict=model_config_overrides_dict,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 179 - 203,
The code currently serializes the model overrides into
model_config_overrides_json with json.dumps(...) and then passes that JSON
string into init_child_from_parent via the model_config_overrides_dict
parameter; change model_config_overrides_json to be a Python dict (e.g., name it
model_config_overrides or keep the variable but remove json.dumps) that contains
{"hidden_size": hidden_size, "ffn": [ {"intermediate_size": intermediate_size}
for intermediate_size in intermediate_sizes ]} and pass that dict to
init_child_from_parent's model_config_overrides_dict so the override argument is
a dict (not a JSON string); keep the rest (dirname, output_dir,
mlp_init_config_yaml, and the call to init_child_from_parent) unchanged.
🧹 Nitpick comments (13)
modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py (4)

15-15: Consider removing or scoping the mypy ignore directive.

A blanket # mypy: ignore-errors suppresses all type checking in this file, which reduces type safety for the substantial new code being added (the MoE expert hooks). Consider either removing this directive and fixing type issues, or using targeted # type: ignore[specific-error] comments only where necessary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` at line 15, Remove
the blanket module-level "# mypy: ignore-errors" at the top of base_hooks.py and
either (a) delete it entirely and fix the reported type issues in the new MoE
expert hooks, or (b) replace it with targeted per-line/type ignores (e.g., "#
type: ignore[xxx]") only on the specific functions/classes that legitimately
need suppression (the new MoE expert hook implementations). Run mypy to identify
the exact errors, apply minimal scoped "# type: ignore[...]" comments next to
the offending symbols (or adjust signatures/annotations in the MoE hook
classes/functions) so the rest of the file retains full type checking.

1144-1166: Placeholder implementation needs completion.

This is a no-op placeholder that returns zeroed router logits and unmodified hidden states, which means expert importance scoring won't work for Qwen3-VL models. The TODO at line 1147 should be tracked.

Would you like me to open an issue to track implementing get_router_logits_and_routed_experts based on the Qwen3VLMoeSparseMoe forward pass?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1144 -
1166, The current
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts returns
dummy zeros and unmodified hidden states; replace this placeholder with code
that mirrors the Qwen3VLMoeSparseMoe forward pass to extract actual router
logits and per-token routed expert outputs: locate the
Qwen3VLMoeSparseMoe.forward implementation and reproduce its routing steps
(compute router logits if router_logits arg is None, apply top-k gating/softmax
or routing mask, gather the expert outputs into routed_experts with shape
(batch*seq, hidden_dim) and return router_logits tensor shaped (batch*seq,
self.num_local_experts) and routed_experts on the same device), ensure
batch_size calculation matches that forward, respect self.num_local_experts and
devices, and add unit tests or assertions validating shapes and non-zero router
logits for routed cases in the
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts method.

1031-1049: Unbounded memory growth from accumulated tensors.

self.router_argsort appends a tensor on every forward pass without bounds. For long calibration runs with many tokens, this could lead to OOM. Consider:

  • Adding a configurable limit with periodic aggregation
  • Using memory-mapped storage for large-scale runs
  • Documenting memory requirements based on expected token count

The int16 conversion is a good memory optimization, though it limits num_experts to ≤32767.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1031 -
1049, The hook currently appends an unbounded tensor to self.router_argsort on
every forward in __call__, causing OOM for long runs; change this by adding a
configurable cap (e.g., max_entries or max_tokens) and implement
eviction/aggregation behavior when the cap is reached (e.g., drop oldest
entries, aggregate into a single summary tensor, or flush to disk/memory-map),
expose the cap via activation_hooks_kwargs or the class init, and update
__call__ to respect the cap when pushing router_argsort (ensure conversions to
int16 remain but add a runtime check that num_experts ≤ 32767 and fall back to a
larger dtype or error if exceeded); reference self.router_argsort, __call__, and
the local router_argsort variable for where to make these changes.

1169-1202: Placeholder implementation needs completion.

Similar to Qwen3VLRemoveExpertsIndependentHook, this placeholder won't provide meaningful expert importance scores for GPT-OSS models. The extensive docstring is helpful but the implementation needs to be completed based on GptOssSparseMoeBlock.forward().

Would you like me to open an issue to track implementing this hook, or would you like assistance sketching out the implementation based on the GPT-OSS MoE forward pass?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1169 -
1202, Summary: The placeholder get_router_logits_and_routed_experts in class
GptOssRemoveExpertsIndependentHook returns zeros and no-op expert outputs; it
must be implemented to mirror GptOssSparseMoeBlock.forward() to produce
meaningful router logits and routed expert outputs. Fix: update
GptOssRemoveExpertsIndependentHook.get_router_logits_and_routed_experts to call
or replicate the routing logic from GptOssSparseMoeBlock.forward() — accept an
optional router_logits input or compute router logits using the same gating
layer(s) used in GptOssSparseMoeBlock, return router_logits shaped (num_tokens,
num_local_experts) and routed_experts as the actual per-token expert outputs
(flatten/reshape to match forward’s routed expert representation), preserve
device/dtype and batch/token ordering, and handle cases where hidden_states is
(batch, seq_len, hidden_dim) or (num_tokens, hidden_dim); reference
GptOssSparseMoeBlock.forward(), the get_router_logits_and_routed_experts method,
and self.num_local_experts to locate where to implement the logic.
modelopt/torch/puzzletron/utils/dummy_modules.py (2)

44-50: Align DummyBlock.forward return type with actual behavior.

The method always returns x, so tuple[torch.Tensor, None] in the annotation is misleading and weakens static checks.

Proposed fix
-    ) -> torch.Tensor | tuple[torch.Tensor, None]:
+    ) -> torch.Tensor:
         return x
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/utils/dummy_modules.py` around lines 44 - 50, The
return type annotation of DummyBlock.forward is incorrect: it currently declares
torch.Tensor | tuple[torch.Tensor, None] but the implementation always returns
x; update the type hint on DummyBlock.forward to return torch.Tensor (remove the
tuple union) to match actual behavior and improve static typing checks,
adjusting any import/type references if needed.

30-33: Use a Protocol to type-hint the incompatible_keys parameter for PyTorch version resilience.

torch.nn.modules.module._IncompatibleKeys is a private API that may break across PyTorch versions. The public contract for register_load_state_dict_post_hook guarantees an object with .missing_keys and .unexpected_keys attributes. Define and use a Protocol instead:

from typing import Protocol

class IncompatibleKeys(Protocol):
    missing_keys: list[str]
    unexpected_keys: list[str]

`@staticmethod`
def load_state_dict_post_hook(
    module: torch.nn.Module,
    incompatible_keys: IncompatibleKeys,
) -> None:

This decouples from the private implementation detail while maintaining type safety.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/utils/dummy_modules.py` around lines 30 - 33,
Replace the use of the private torch type in the load_state_dict_post_hook
signature by defining a small Protocol (e.g., IncompatibleKeys with attributes
missing_keys: list[str] and unexpected_keys: list[str]) and change the function
parameter type to that Protocol; import Protocol from typing, remove the
torch.nn.modules.module._IncompatibleKeys annotation, and keep the function name
load_state_dict_post_hook and its behavior unchanged so the hook accepts any
object exposing .missing_keys and .unexpected_keys.
tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml (1)

13-14: Track TODO items for init mode compatibility.

These TODO comments indicate known limitations with the mlp_init_mode and gqa_init_mode settings. Consider creating issues to track these improvements so they aren't forgotten.

Would you like me to open issues to track the CopyAsIs/FromTeacher compatibility work for these init modes?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml`
around lines 13 - 14, Add tracked issues for the TODOs so the compatibility work
for init modes isn't lost: create two issues referencing mlp_init_mode and
gqa_init_mode respectively, describe current limitation ("Truncate" and
"AverageKV" only) and request support for CopyAsIs and FromTeacher
initialization paths, link to the config file and mention symbols CopyAsIs and
FromTeacher, and assign/label them as enhancement and technical-debt so they can
be prioritized.
tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py (1)

30-62: Add if __name__ == "__main__" guard to prevent execution on import.

This script executes immediately when imported, which could cause unintended side effects. Wrapping the execution logic in a main guard is a best practice, especially for utility scripts.

♻️ Proposed fix to add main guard
-with open(in_path, encoding="utf-8") as f:
-    tokenizer_data = json.load(f)
-
-# Get and sort the original vocab by index (frequency proxy)
-orig_vocab = tokenizer_data["model"]["vocab"]
-
-# Sort tokens by their original index (lowest index = assumed most common/important)
-sorted_tokens = sorted(orig_vocab.items(), key=lambda item: item[1])
-
-# Keep the top N tokens
-tokens_to_keep = [tok for tok, idx in sorted_tokens[:NUM_TO_KEEP]]
-
-# Re-index the selected tokens: 0..N-1
-small_vocab = {tok: i for i, tok in enumerate(tokens_to_keep)}
-tokenizer_data["model"]["vocab"] = small_vocab
-
-# Update vocab size
-if "vocab_size" in tokenizer_data["model"]:
-    tokenizer_data["model"]["vocab_size"] = len(small_vocab)
-
-# Optionally remove merges if present and unneeded (mostly for BPE/WordPiece)
-if "merges" in tokenizer_data["model"]:
-    tokenizer_data["model"]["merges"] = []
-
-# Remove added_tokens if not needed
-if "added_tokens" in tokenizer_data:
-    tokenizer_data["added_tokens"] = []
-
-# Write out the truncated tokenizer.json
-with open(out_path, "w", encoding="utf-8") as f:
-    json.dump(tokenizer_data, f, indent=2, ensure_ascii=False)
-
-print(f"Truncated tokenizer saved to: {out_path}")
+if __name__ == "__main__":
+    with open(in_path, encoding="utf-8") as f:
+        tokenizer_data = json.load(f)
+
+    # Get and sort the original vocab by index (frequency proxy)
+    orig_vocab = tokenizer_data["model"]["vocab"]
+
+    # Sort tokens by their original index (lowest index = assumed most common/important)
+    sorted_tokens = sorted(orig_vocab.items(), key=lambda item: item[1])
+
+    # Keep the top N tokens
+    tokens_to_keep = [tok for tok, idx in sorted_tokens[:NUM_TO_KEEP]]
+
+    # Re-index the selected tokens: 0..N-1
+    small_vocab = {tok: i for i, tok in enumerate(tokens_to_keep)}
+    tokenizer_data["model"]["vocab"] = small_vocab
+
+    # Update vocab size
+    if "vocab_size" in tokenizer_data["model"]:
+        tokenizer_data["model"]["vocab_size"] = len(small_vocab)
+
+    # Optionally remove merges if present and unneeded (mostly for BPE/WordPiece)
+    if "merges" in tokenizer_data["model"]:
+        tokenizer_data["model"]["merges"] = []
+
+    # Remove added_tokens if not needed
+    if "added_tokens" in tokenizer_data:
+        tokenizer_data["added_tokens"] = []
+
+    # Write out the truncated tokenizer.json
+    with open(out_path, "w", encoding="utf-8") as f:
+        json.dump(tokenizer_data, f, indent=2, ensure_ascii=False)
+
+    print(f"Truncated tokenizer saved to: {out_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py` around
lines 30 - 62, The file executes its truncation logic at import time; wrap the
top-level code (the open(in_path)... json.dump(...) and print(...) sequence)
into a main() function (or similar) and add an if __name__ == "__main__": guard
that calls main(), ensuring variables like in_path, out_path and NUM_TO_KEEP are
obtained/used inside main() so importing this module does not run the truncation
immediately.
modelopt/torch/puzzletron/pruning/pruning_mixin.py (1)

52-73: Remove or implement the commented-out code.

The commented-out prune_single_layer abstract method (lines 63-73) should either be implemented or removed to avoid confusion about the intended interface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_mixin.py` around lines 52 - 73, The
commented-out abstract method prune_single_layer in class PruningMixIn is
ambiguous and should be resolved: either uncomment and declare it as an abstract
method with the given signature (def prune_single_layer(self, layer_idx: int,
parent_state_dict: dict, new_state_dict: dict, original_config:
PretrainedConfig, new_config: PretrainedConfig, **kwargs) -> None: raise
NotImplementedError) so implementers must provide pruning logic, or remove the
entire commented block to avoid suggesting an interface that doesn't exist;
update the class docstring or supported_hooks() docstring to document the chosen
interface change and ensure references in callers (e.g., any code expecting
PruningMixIn.prune_single_layer) are adjusted accordingly.
modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py (1)

115-120: Document in-place mutation of keys parameter.

The method mutates the keys dict passed as an argument by popping entries. This side effect should be documented in the method's docstring or the parameter should be copied if mutation is unintended.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py` around
lines 115 - 120, The loop mutates the passed-in dict parameter `keys` by popping
entries (see loops referencing `parent_router_keys` and `orig_experts_keys`);
either update the method/class docstring to explicitly state that `keys` is
modified in-place, or avoid the side effect by operating on a shallow copy
(e.g., work with a local copy of `keys` before the two pop loops) so callers'
dicts are not mutated unexpectedly—reference the method in
expert_removal_pruning_mixin (where `keys`, `parent_router_keys`, and
`orig_experts_keys` are used) when making the change.
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py (1)

260-275: Consider validating list homogeneity.

maybe_cast_block_configs checks only the first element to determine if the list contains dicts. If the list contains mixed types (e.g., [dict, BlockConfig]), this could lead to unexpected behavior.

♻️ Proposed fix to handle mixed types
 def maybe_cast_block_configs(
     block_configs: List[BlockConfig | dict] | None,
 ) -> List[BlockConfig] | None:
     if not block_configs:
         return block_configs
-    if isinstance(block_configs[0], dict):
-        return [BlockConfig(**conf) for conf in block_configs]
-    return block_configs
+    return [
+        BlockConfig(**conf) if isinstance(conf, dict) else conf
+        for conf in block_configs
+    ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py` around
lines 260 - 275, The function maybe_cast_block_configs currently inspects only
block_configs[0]; change it to validate homogeneity across the list: iterate
over block_configs and if every element is a dict, return [BlockConfig(**conf)
for conf in block_configs]; if every element is a BlockConfig, return the list
as-is; if the list contains mixed types, raise a TypeError (or ValueError) with
a clear message mentioning maybe_cast_block_configs and BlockConfig so callers
know the input must be all dicts or all BlockConfig instances; keep None/empty
behavior unchanged.
modelopt/torch/puzzletron/pruning/pruning_utils.py (1)

186-186: Consider thread-safety for global cache.

ACTIVATIONS_LOG is a global mutable dict that could cause race conditions in multi-process distributed training scenarios. Consider using a thread-safe approach or documenting the single-process assumption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` at line 186,
ACTIVATIONS_LOG is a global mutable dict that can race in multi-thread/process
training; make it thread/process-safe by replacing it with a protected/shared
structure (e.g., wrap accesses with a module-level threading.RLock or use a
multiprocessing.Manager().dict() if you need cross-process visibility) and
update every read/write to acquire the lock (or access the Manager dict) where
ACTIVATIONS_LOG is used, or explicitly document in the module docstring that the
module assumes single-process execution if you choose not to change behavior.
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)

119-124: Fail fast on unknown descriptor names in ModelDescriptorFactory.get.

Returning the raw string for unknown descriptors delays errors to downstream code paths. Raising here gives clearer diagnostics at the source.

Proposed fix
     `@classmethod`
     def get(cls, value: str | ModelDescriptor):
         """Get a registered model descriptor by name or return the descriptor if already resolved."""
         if isinstance(value, str):
-            if value in cls.CLASS_MAPPING:
-                return cls.CLASS_MAPPING[value]
+            if value in cls.CLASS_MAPPING:
+                return cls.CLASS_MAPPING[value]
+            raise KeyError(
+                f"Unknown model descriptor '{value}'. "
+                f"Known descriptors: {sorted(cls.CLASS_MAPPING.keys())}"
+            )
         return value
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`
around lines 119 - 124, The get method in ModelDescriptorFactory currently
returns the raw string when an unknown descriptor name is passed, which delays
errors; modify ModelDescriptorFactory.get to check if value is a str and, if
it's not present in ModelDescriptorFactory.CLASS_MAPPING, raise a clear
exception (e.g., KeyError or ValueError) that includes the unknown name and a
brief hint about valid keys, otherwise return the mapped descriptor; keep
existing behavior of returning the passed-through ModelDescriptor when value is
already resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py`:
- Around line 849-856: Avoid calling next(self.moe.parameters()) without a
fallback and stop reusing self.diffs as both list and dict: first extract the
keys into a local variable (e.g., diff_keys = ["mse", "cosine"]), then get a
safe param sample via sample_param = next(self.moe.parameters(), None) and if
None raise a clear error or handle it (e.g., create tensors on cpu or choose
device explicitly); finally set self.diffs = {k:
torch.zeros(size=(self.num_local_experts,), dtype=torch.float32,
device=sample_param.device if sample_param is not None else torch.device("cpu"))
for k in diff_keys} so initialization is clear and robust.

In `@modelopt/torch/puzzletron/anymodel/README.md`:
- Around line 97-99: Update the README.md end-to-end example link which
currently points to a non-existent
tests/gpu/torch/puzzletron/test_compress_model.py (or test_compress.py); change
the reference to the actual test file
tests/gpu/torch/puzzletron/test_puzzletron.py so the example link and
description in README.md correctly point to test_puzzletron.py.

In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py`:
- Around line 122-132: The code in the fused-experts branch (guarded by
layer_descriptor.is_fused_experts) assumes fused_keys lists are non-empty and
directly indexes fused_keys[0], which can raise IndexError; update the logic in
the block that populates orig_experts_weights and new_experts_weights to
validate fused_keys before using fused_keys[0] (e.g., skip or raise a clear
error when fused_keys is empty), referencing the variables
orig_experts_keys/new_experts_keys, parent_state_dict/new_state_dict and the
dicts orig_experts_weights/new_experts_weights so the fix is applied where those
are constructed.

In `@modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py`:
- Around line 84-100: The code accesses parent_state_dict[mlp_key] without
verifying the key exists, risking a KeyError; update the loop so before calling
_init_mlp_module you retrieve parent_val = parent_state_dict.get(mlp_key) (or
handle missing key explicitly) and pass parent_val instead of
parent_state_dict[mlp_key]; adjust _init_mlp_module invocation in this block
(the call that assigns mlp_module_weight, pruned_filters, projection_matrix) to
accept a possibly None parent value and ensure layer_out_state_dict[mlp_key] is
set only after a successful init.
- Around line 29-32: The imported _init_mlp_module is coming from
modelopt.torch.puzzletron.tools.bypassed_training.child_init but the code calls
it using the pruning_utils signature; update the import to pull _init_mlp_module
from modelopt.torch.puzzletron.pruning.pruning_utils (or the module where
pruning_utils lives) instead of child_init, keeping MlpInitMode import as needed
so the call in this file (which passes mlp_prefix, expanded_dim, layer_idx,
new_item, new_config, orig_item, original_config, mlp_init_config, etc.) matches
the function signature in pruning_utils; simply change the from ... import line
to import _init_mlp_module from pruning_utils so no other call-site changes are
required.

In `@modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py`:
- Line 75: The code in kv_heads_pruning_mixin.py directly reads
new_config.head_dim into head_size, which can raise if that attribute is absent;
update the assignment in the function/method where head_size is set (referencing
new_config and head_size in kv_heads_pruning_mixin) to use a fallback via
getattr(new_config, "head_dim", new_config.hidden_size //
new_config.num_attention_heads) so head_size is computed safely when head_dim is
not defined.
- Around line 71-73: The unpacking of q_name, k_name, v_name, o_name assumes
self.layer_descriptor.qkvo_weight_names has exactly four entries; validate the
length before unpacking (e.g., check
len(self.layer_descriptor.qkvo_weight_names) == 4) and raise a clear error or
fall back to a safe behavior; alternatively implement defensive unpacking by
iterating over qkvo_weight_names to build names or using default/fallback
values. Locate the code using self.layer_descriptor.qkvo_weight_names (in
KVHeadsLayerDescriptor and the KVHeadsPruningMixin where q_name, k_name, v_name,
o_name are assigned) and add the length check or defensive logic to prevent
ValueError on unexpected lengths.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 195-201: The torch.load call inside the ACTIVATIONS_LOG.update
comprehension (iterating Path(activations_log_dir).glob("rank_*.pth")) must
either pass weights_only=True or include an inline comment justifying why
loading with executable objects is safe; update the torch.load(...) invocation
used to read module logs (the expression for module_name, module_log in
torch.load(p).items()) to torch.load(p, weights_only=True) or add a concise
comment immediately after the call explaining that files are internally
generated and never user-supplied, referencing activations_log_dir and the
ACTIVATIONS_LOG update for context.
- Around line 369-372: The code accesses a non-existent attribute
attention.n_heads_in_group (used to compute n_heads_in_group,
orig_n_heads_in_group, num_kv_heads, orig_num_kv_heads) which will raise
AttributeError; update pruning_utils.py to defensively read n_heads_in_group by
first checking hasattr or getattr on
new_config.block_configs[layer_idx].attention and
original_config.block_configs[layer_idx].attention, and if missing derive the
value from the existing num_key_value_heads field (or via the mapping used in
configuration_decilm.py) before computing num_kv_heads/orig_num_kv_heads so the
logic works when AttentionConfig only provides num_key_value_heads. Ensure you
reference the attention config objects used in pruning_utils.py
(new_config.block_configs[layer_idx].attention and
original_config.block_configs[layer_idx].attention) when implementing this
fallback.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 184-193: The code writes the safetensors index (index_path /
SAFE_WEIGHTS_INDEX_NAME via _write_file_process_safe) before removing tied
output-embedding entries, and it checks the hardcoded "lm_head.weight" instead
of using the descriptor; move the tied-embedding handling block to run before
building/writing the index_json and index_path, and replace the hardcoded string
with the descriptor-derived name computed as lm_head_weight_name =
f"{descriptor.output_embedding_name()}.weight"; then filter both state_dict and
weight_map to drop that lm_head_weight_name so the persisted index and
weight_map are consistent with the saved tensors.
- Around line 130-132: The call to AutoConfig.from_pretrained hardcodes
trust_remote_code=True; update the surrounding function signature to accept a
new parameter trust_remote_code: bool = False (expose it to callers) and pass
that parameter into AutoConfig.from_pretrained instead of the literal True,
preserving existing kwargs (checkpoint_dir, model_config_overrides,
return_unused_kwargs=True) so callers can opt in to remote code only when
needed; ensure any callers of the function are updated or that the parameter is
backwards-compatible by using the default False.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 122-182: The test disabled the core Puzzletron pipeline
(puzzletron.puzzletron invocation) and all downstream assertions, so restore the
call to puzzletron.puzzletron(str(hydra_config_dir), hydra_config_subdir,
str(puzzle_dir), str(dataset_path)) and re-enable the assertions block that
checks both MoE and non-MoE paths; ensure the branches referencing
has_moe_layers and the helper validators (_assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions) are active again, plus
the file/directory existence checks (replacement_library.json,
subblock_stats.json, solution_0.json and the mip solution directory checks) so
the test verifies pruning/MIP/realization behavior end-to-end.
- Around line 248-269: The test currently indexes EXPECTED_PRUNING_VALUES with
hf_config_name which raises KeyError before the None branch can run; change the
lookup to use EXPECTED_PRUNING_VALUES.get(hf_config_name) (or
.get(hf_config_name, None)) when assigning expected so missing keys return None
and the existing fallback branch executes; update the reference where expected
is set (the assignment using EXPECTED_PRUNING_VALUES and hf_config_name) and
leave the subsequent logic that checks if expected is not None unchanged.

---

Outside diff comments:
In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py`:
- Around line 319-337: The code currently assumes
resolve_pruning_mixin(cfg.pruning.pruning_mixin, cfg.descriptor) returns a
single object (pruning_mixin) but may return a list; update the dispatch to
handle list-valued mixins by detecting if pruning_mixin is a list/tuple and
iterating over its elements, calling the appropriate launcher for each element
(use the existing launch_ffn_intermediates_prune_ckpt,
launch_attn_groups_prune_ckpt, and launch_experts_prune_ckpt), otherwise fall
back to the single-instance isinstance checks; also validate element types and
raise a clear NotImplementedError for unsupported mixin types so configs that
return lists are processed correctly.
- Around line 309-310: The symlink creation using os.symlink(output_dir,
os.path.join(cfg.puzzle_dir, "ckpts", dirname)) can fail if cfg.puzzle_dir/ckpts
does not exist; before calling os.symlink in pruning_ckpts.py ensure the target
directory exists by creating cfg.puzzle_dir/ckpts (e.g., call
os.makedirs(os.path.join(cfg.puzzle_dir, "ckpts"), exist_ok=True) or equivalent)
and then create the symlink, keeping the same variables output_dir and dirname.
- Around line 179-203: The code currently serializes the model overrides into
model_config_overrides_json with json.dumps(...) and then passes that JSON
string into init_child_from_parent via the model_config_overrides_dict
parameter; change model_config_overrides_json to be a Python dict (e.g., name it
model_config_overrides or keep the variable but remove json.dumps) that contains
{"hidden_size": hidden_size, "ffn": [ {"intermediate_size": intermediate_size}
for intermediate_size in intermediate_sizes ]} and pass that dict to
init_child_from_parent's model_config_overrides_dict so the override argument is
a dict (not a JSON string); keep the rest (dirname, output_dir,
mlp_init_config_yaml, and the call to init_child_from_parent) unchanged.

---

Nitpick comments:
In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py`:
- Line 15: Remove the blanket module-level "# mypy: ignore-errors" at the top of
base_hooks.py and either (a) delete it entirely and fix the reported type issues
in the new MoE expert hooks, or (b) replace it with targeted per-line/type
ignores (e.g., "# type: ignore[xxx]") only on the specific functions/classes
that legitimately need suppression (the new MoE expert hook implementations).
Run mypy to identify the exact errors, apply minimal scoped "# type:
ignore[...]" comments next to the offending symbols (or adjust
signatures/annotations in the MoE hook classes/functions) so the rest of the
file retains full type checking.
- Around line 1144-1166: The current
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts returns
dummy zeros and unmodified hidden states; replace this placeholder with code
that mirrors the Qwen3VLMoeSparseMoe forward pass to extract actual router
logits and per-token routed expert outputs: locate the
Qwen3VLMoeSparseMoe.forward implementation and reproduce its routing steps
(compute router logits if router_logits arg is None, apply top-k gating/softmax
or routing mask, gather the expert outputs into routed_experts with shape
(batch*seq, hidden_dim) and return router_logits tensor shaped (batch*seq,
self.num_local_experts) and routed_experts on the same device), ensure
batch_size calculation matches that forward, respect self.num_local_experts and
devices, and add unit tests or assertions validating shapes and non-zero router
logits for routed cases in the
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts method.
- Around line 1031-1049: The hook currently appends an unbounded tensor to
self.router_argsort on every forward in __call__, causing OOM for long runs;
change this by adding a configurable cap (e.g., max_entries or max_tokens) and
implement eviction/aggregation behavior when the cap is reached (e.g., drop
oldest entries, aggregate into a single summary tensor, or flush to
disk/memory-map), expose the cap via activation_hooks_kwargs or the class init,
and update __call__ to respect the cap when pushing router_argsort (ensure
conversions to int16 remain but add a runtime check that num_experts ≤ 32767 and
fall back to a larger dtype or error if exceeded); reference
self.router_argsort, __call__, and the local router_argsort variable for where
to make these changes.
- Around line 1169-1202: Summary: The placeholder
get_router_logits_and_routed_experts in class GptOssRemoveExpertsIndependentHook
returns zeros and no-op expert outputs; it must be implemented to mirror
GptOssSparseMoeBlock.forward() to produce meaningful router logits and routed
expert outputs. Fix: update
GptOssRemoveExpertsIndependentHook.get_router_logits_and_routed_experts to call
or replicate the routing logic from GptOssSparseMoeBlock.forward() — accept an
optional router_logits input or compute router logits using the same gating
layer(s) used in GptOssSparseMoeBlock, return router_logits shaped (num_tokens,
num_local_experts) and routed_experts as the actual per-token expert outputs
(flatten/reshape to match forward’s routed expert representation), preserve
device/dtype and batch/token ordering, and handle cases where hidden_states is
(batch, seq_len, hidden_dim) or (num_tokens, hidden_dim); reference
GptOssSparseMoeBlock.forward(), the get_router_logits_and_routed_experts method,
and self.num_local_experts to locate where to implement the logic.

In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 119-124: The get method in ModelDescriptorFactory currently
returns the raw string when an unknown descriptor name is passed, which delays
errors; modify ModelDescriptorFactory.get to check if value is a str and, if
it's not present in ModelDescriptorFactory.CLASS_MAPPING, raise a clear
exception (e.g., KeyError or ValueError) that includes the unknown name and a
brief hint about valid keys, otherwise return the mapped descriptor; keep
existing behavior of returning the passed-through ModelDescriptor when value is
already resolved.

In `@modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py`:
- Around line 260-275: The function maybe_cast_block_configs currently inspects
only block_configs[0]; change it to validate homogeneity across the list:
iterate over block_configs and if every element is a dict, return
[BlockConfig(**conf) for conf in block_configs]; if every element is a
BlockConfig, return the list as-is; if the list contains mixed types, raise a
TypeError (or ValueError) with a clear message mentioning
maybe_cast_block_configs and BlockConfig so callers know the input must be all
dicts or all BlockConfig instances; keep None/empty behavior unchanged.

In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py`:
- Around line 115-120: The loop mutates the passed-in dict parameter `keys` by
popping entries (see loops referencing `parent_router_keys` and
`orig_experts_keys`); either update the method/class docstring to explicitly
state that `keys` is modified in-place, or avoid the side effect by operating on
a shallow copy (e.g., work with a local copy of `keys` before the two pop loops)
so callers' dicts are not mutated unexpectedly—reference the method in
expert_removal_pruning_mixin (where `keys`, `parent_router_keys`, and
`orig_experts_keys` are used) when making the change.

In `@modelopt/torch/puzzletron/pruning/pruning_mixin.py`:
- Around line 52-73: The commented-out abstract method prune_single_layer in
class PruningMixIn is ambiguous and should be resolved: either uncomment and
declare it as an abstract method with the given signature (def
prune_single_layer(self, layer_idx: int, parent_state_dict: dict,
new_state_dict: dict, original_config: PretrainedConfig, new_config:
PretrainedConfig, **kwargs) -> None: raise NotImplementedError) so implementers
must provide pruning logic, or remove the entire commented block to avoid
suggesting an interface that doesn't exist; update the class docstring or
supported_hooks() docstring to document the chosen interface change and ensure
references in callers (e.g., any code expecting PruningMixIn.prune_single_layer)
are adjusted accordingly.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Line 186: ACTIVATIONS_LOG is a global mutable dict that can race in
multi-thread/process training; make it thread/process-safe by replacing it with
a protected/shared structure (e.g., wrap accesses with a module-level
threading.RLock or use a multiprocessing.Manager().dict() if you need
cross-process visibility) and update every read/write to acquire the lock (or
access the Manager dict) where ACTIVATIONS_LOG is used, or explicitly document
in the module docstring that the module assumes single-process execution if you
choose not to change behavior.

In `@modelopt/torch/puzzletron/utils/dummy_modules.py`:
- Around line 44-50: The return type annotation of DummyBlock.forward is
incorrect: it currently declares torch.Tensor | tuple[torch.Tensor, None] but
the implementation always returns x; update the type hint on DummyBlock.forward
to return torch.Tensor (remove the tuple union) to match actual behavior and
improve static typing checks, adjusting any import/type references if needed.
- Around line 30-33: Replace the use of the private torch type in the
load_state_dict_post_hook signature by defining a small Protocol (e.g.,
IncompatibleKeys with attributes missing_keys: list[str] and unexpected_keys:
list[str]) and change the function parameter type to that Protocol; import
Protocol from typing, remove the torch.nn.modules.module._IncompatibleKeys
annotation, and keep the function name load_state_dict_post_hook and its
behavior unchanged so the hook accepts any object exposing .missing_keys and
.unexpected_keys.

In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml`:
- Around line 13-14: Add tracked issues for the TODOs so the compatibility work
for init modes isn't lost: create two issues referencing mlp_init_mode and
gqa_init_mode respectively, describe current limitation ("Truncate" and
"AverageKV" only) and request support for CopyAsIs and FromTeacher
initialization paths, link to the config file and mention symbols CopyAsIs and
FromTeacher, and assign/label them as enhancement and technical-debt so they can
be prioritized.

In `@tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py`:
- Around line 30-62: The file executes its truncation logic at import time; wrap
the top-level code (the open(in_path)... json.dump(...) and print(...) sequence)
into a main() function (or similar) and add an if __name__ == "__main__": guard
that calls main(), ensuring variables like in_path, out_path and NUM_TO_KEEP are
obtained/used inside main() so importing this module does not run the truncation
immediately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33561e3e-db24-44f3-b8b6-dc7786856232

📥 Commits

Reviewing files that changed from the base of the PR and between e82164f and 2099df3.

📒 Files selected for processing (28)
  • modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py
  • modelopt/torch/puzzletron/anymodel/README.md
  • modelopt/torch/puzzletron/anymodel/__init__.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
  • modelopt/torch/puzzletron/anymodel/models/__init__.py
  • modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py
  • modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py
  • modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py
  • modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py
  • modelopt/torch/puzzletron/pruning/pruning_ckpts.py
  • modelopt/torch/puzzletron/pruning/pruning_mixin.py
  • modelopt/torch/puzzletron/pruning/pruning_utils.py
  • modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
  • modelopt/torch/puzzletron/utils/dummy_modules.py
  • tests/_test_utils/torch/puzzletron/utils.py
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/attn_pruning.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/ffn_pruning.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/pruning_defaults.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/validate_model_defaults.yaml
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/validate_solutions_defaults.yaml
  • tests/gpu/torch/puzzletron/resources/hf_configs/llama_3_1_8b_instruct/config.json
  • tests/gpu/torch/puzzletron/resources/tokenizer/special_tokens_map.json
  • tests/gpu/torch/puzzletron/resources/tokenizer/tokenizer.json
  • tests/gpu/torch/puzzletron/resources/tokenizer/tokenizer_config.json
  • tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py
  • tests/gpu/torch/puzzletron/test_puzzletron.py
✅ Files skipped from review due to trivial changes (2)
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/pruning_defaults.yaml
  • tests/gpu/torch/puzzletron/resources/tokenizer/special_tokens_map.json

Comment on lines +849 to +856
self.diffs = ["mse", "cosine"]
some_param = next(self.moe.parameters())
self.diffs = {
k: torch.zeros(
size=(self.num_local_experts,), dtype=torch.float32, device=some_param.device
)
for k in self.diffs
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential StopIteration and confusing initialization pattern.

  1. next(self.moe.parameters()) will raise StopIteration if the MoE module has no parameters, causing a cryptic error. Consider using next(..., None) with a fallback or explicit validation.

  2. The initialization of self.diffs as a list then immediately overwriting it as a dict is confusing. Extract the keys to a local variable for clarity.

Proposed fix
-        self.diffs = ["mse", "cosine"]
-        some_param = next(self.moe.parameters())
+        diff_keys = ["mse", "cosine"]
+        try:
+            some_param = next(self.moe.parameters())
+        except StopIteration:
+            raise ValueError("MoE module has no parameters - cannot determine device")
         self.diffs = {
             k: torch.zeros(
                 size=(self.num_local_experts,), dtype=torch.float32, device=some_param.device
             )
-            for k in self.diffs
+            for k in diff_keys
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 849 -
856, Avoid calling next(self.moe.parameters()) without a fallback and stop
reusing self.diffs as both list and dict: first extract the keys into a local
variable (e.g., diff_keys = ["mse", "cosine"]), then get a safe param sample via
sample_param = next(self.moe.parameters(), None) and if None raise a clear error
or handle it (e.g., create tensors on cpu or choose device explicitly); finally
set self.diffs = {k: torch.zeros(size=(self.num_local_experts,),
dtype=torch.float32, device=sample_param.device if sample_param is not None else
torch.device("cpu")) for k in diff_keys} so initialization is clear and robust.

Comment on lines +97 to +99
## End-to-end example

See [test_compress_model.py](../../../../tests/gpu/torch/puzzletron/test_compress.py) for a complete example that runs both convert and compression steps.
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "test_compress" tests/gpu/torch/puzzletron/

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

find . -name "*test_compress*" -type f 2>/dev/null | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

ls -la tests/gpu/torch/puzzletron/ 2>/dev/null || echo "Directory not found"
ls -la tests/ 2>/dev/null | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1010


🏁 Script executed:

head -100 tests/gpu/torch/puzzletron/test_puzzletron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3245


🏁 Script executed:

sed -n '95,105p' modelopt/torch/puzzletron/anymodel/README.md

Repository: NVIDIA/Model-Optimizer

Length of output: 443


Fix incorrect test file reference.

The documentation references a non-existent file path. Neither test_compress_model.py nor test_compress.py exists in tests/gpu/torch/puzzletron/. The actual test file containing compression examples in that directory is test_puzzletron.py. Update the reference to point to the correct file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/anymodel/README.md` around lines 97 - 99, Update
the README.md end-to-end example link which currently points to a non-existent
tests/gpu/torch/puzzletron/test_compress_model.py (or test_compress.py); change
the reference to the actual test file
tests/gpu/torch/puzzletron/test_puzzletron.py so the example link and
description in README.md correctly point to test_puzzletron.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa PTAL. We renamed from compress to puzzletron

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +122 to +132
if self.layer_descriptor.is_fused_experts:
# Fused format: unbundle single tensor [num_experts, ...] into list of per-expert tensors
orig_experts_weights = {}
for name, fused_keys in orig_experts_keys.items():
fused_tensor = parent_state_dict[fused_keys[0]] # Single fused tensor
orig_experts_weights[name] = [fused_tensor[i] for i in range(orig_num_experts)]

new_experts_weights = {}
for name, fused_keys in new_experts_keys.items():
fused_tensor = new_state_dict[fused_keys[0]] # Single fused tensor
new_experts_weights[name] = [fused_tensor[i] for i in range(new_num_experts)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add guard for empty fused_keys list.

Accessing fused_keys[0] (lines 126, 131) will raise an IndexError if fused_expert_weights is empty. Consider adding validation.

🛡️ Proposed fix
         if self.layer_descriptor.is_fused_experts:
             # Fused format: unbundle single tensor [num_experts, ...] into list of per-expert tensors
+            if not self.layer_descriptor.fused_expert_weights:
+                raise ValueError("is_fused_experts=True but fused_expert_weights is empty")
             orig_experts_weights = {}
             for name, fused_keys in orig_experts_keys.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py` around
lines 122 - 132, The code in the fused-experts branch (guarded by
layer_descriptor.is_fused_experts) assumes fused_keys lists are non-empty and
directly indexes fused_keys[0], which can raise IndexError; update the logic in
the block that populates orig_experts_weights and new_experts_weights to
validate fused_keys before using fused_keys[0] (e.g., skip or raise a clear
error when fused_keys is empty), referencing the variables
orig_experts_keys/new_experts_keys, parent_state_dict/new_state_dict and the
dicts orig_experts_weights/new_experts_weights so the fix is applied where those
are constructed.

Comment on lines +84 to +100
for mlp_key in mlp_keys:
expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0
if mlp_key in new_state_dict.keys():
mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module(
mlp_init_mode,
mlp_prefix,
expanded_dim,
layer_idx,
new_state_dict[mlp_key],
new_config,
parent_state_dict[mlp_key],
original_config,
mlp_init_config,
pruned_filters,
projection_matrix,
)
layer_out_state_dict[mlp_key] = mlp_module_weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation for parent_state_dict key existence.

The code checks if mlp_key exists in new_state_dict (line 86) but doesn't verify it exists in parent_state_dict before accessing it (line 94). This could cause a KeyError if the parent model has different keys.

🛡️ Proposed fix
         for mlp_key in mlp_keys:
             expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0
-            if mlp_key in new_state_dict.keys():
+            if mlp_key in new_state_dict.keys() and mlp_key in parent_state_dict.keys():
                 mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for mlp_key in mlp_keys:
expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0
if mlp_key in new_state_dict.keys():
mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module(
mlp_init_mode,
mlp_prefix,
expanded_dim,
layer_idx,
new_state_dict[mlp_key],
new_config,
parent_state_dict[mlp_key],
original_config,
mlp_init_config,
pruned_filters,
projection_matrix,
)
layer_out_state_dict[mlp_key] = mlp_module_weight
for mlp_key in mlp_keys:
expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0
if mlp_key in new_state_dict.keys() and mlp_key in parent_state_dict.keys():
mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module(
mlp_init_mode,
mlp_prefix,
expanded_dim,
layer_idx,
new_state_dict[mlp_key],
new_config,
parent_state_dict[mlp_key],
original_config,
mlp_init_config,
pruned_filters,
projection_matrix,
)
layer_out_state_dict[mlp_key] = mlp_module_weight
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py` around
lines 84 - 100, The code accesses parent_state_dict[mlp_key] without verifying
the key exists, risking a KeyError; update the loop so before calling
_init_mlp_module you retrieve parent_val = parent_state_dict.get(mlp_key) (or
handle missing key explicitly) and pass parent_val instead of
parent_state_dict[mlp_key]; adjust _init_mlp_module invocation in this block
(the call that assigns mlp_module_weight, pruned_filters, projection_matrix) to
accept a possibly None parent value and ensure layer_out_state_dict[mlp_key] is
set only after a successful init.

Comment on lines +184 to +193
index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
index_json = json_dumps(index)
_write_file_process_safe(index_json, index_path)

# Phase 4: Copy HF code
phase4_start_time = time.time()
copy_deci_lm_hf_code(checkpoint_dir)
phase4_time = time.time() - phase4_start_time
mprint(f"Phase 4 - Copy HF code: {phase4_time:.2f}s")
# Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens
if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict:
lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name}
weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name}
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix tied-embedding handling before writing the safetensors index.

Lines 184-187 write the index before Lines 190-193 remove tied output embedding entries. That can leave stale entries in the persisted index. Also, Line 190 checks hardcoded "lm_head.weight" instead of descriptor-derived output embedding name.

Proposed fix
-    # Write index
-    index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
-    index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
-    index_json = json_dumps(index)
-    _write_file_process_safe(index_json, index_path)
-
     # Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens
-    if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict:
-        lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
-        state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name}
-        weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name}
+    output_weight_name = f"{descriptor.output_embedding_name()}.weight"
+    if getattr(model_config, "tie_word_embeddings", False) and output_weight_name in state_dict:
+        state_dict = {k: v for k, v in state_dict.items() if k != output_weight_name}
+        weight_map = {k: v for k, v in weight_map.items() if k != output_weight_name}
+
+    # Write index after finalizing weight_map
+    index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
+    index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
+    index_json = json_dumps(index)
+    _write_file_process_safe(index_json, index_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 184 -
193, The code writes the safetensors index (index_path / SAFE_WEIGHTS_INDEX_NAME
via _write_file_process_safe) before removing tied output-embedding entries, and
it checks the hardcoded "lm_head.weight" instead of using the descriptor; move
the tied-embedding handling block to run before building/writing the index_json
and index_path, and replace the hardcoded string with the descriptor-derived
name computed as lm_head_weight_name =
f"{descriptor.output_embedding_name()}.weight"; then filter both state_dict and
weight_map to drop that lm_head_weight_name so the persisted index and
weight_map are consistent with the saved tensors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa this bug fix seems important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +122 to 182
# TODO commented for the duration of merging process from dkorzekwa/any_model to feature/puzzletron
# # Compress the model using a one-click approach
# puzzletron.puzzletron(
# str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path)
# )

#
# Check assertions
#
# assertions for the score_pruning_activations step 1
_assert_score_pruning_activations(puzzle_dir)
if rank == 0:
# assertions for the pruning_ckpts step 2
assert (puzzle_dir / "ckpts/ffn_256_attn_no_op").exists()
# #
# # Check assertions
# #
# if rank == 0:
# if has_moe_layers:
# # assertions for the score_pruning_activations step 1 (MoE models only)
# rank_filepath = (
# f"pruning/pruning_scores/expert_removal/10samples_diverse_mini/rank_{rank}.pth"
# )
# assert (puzzle_dir / rank_filepath).is_file(), f"Expected {rank_filepath} to exist"

# assertions for the build_library_and_stats step 4
# # assertions for the pruning_ckpts step 2
# assert (puzzle_dir / "ckpts/num_experts_8").exists()

assert (puzzle_dir / "replacement_library.json").is_file()
assert (puzzle_dir / "subblock_stats.json").is_file()
# # assertions for the mip_and_realize_models step 6
# # Find the MIP solution directory dynamically (e.g., stats_num_local_experts_*)
# mip_solutions_dir = puzzle_dir / "mip/puzzle_solutions"
# solution_dirs = [
# d
# for d in mip_solutions_dir.iterdir()
# if d.is_dir() and d.name.startswith("stats_num_local_experts_")
# ]
# assert len(solution_dirs) == 1, (
# f"Expected exactly one stats_num_local_experts_* directory, found: {[d.name for d in solution_dirs]}"
# )
# solution_dir = solution_dirs[0]

# assertions for the scoring step 5
solution_0_filepath = (
puzzle_dir / "single_sequence_replacement_solutions--validation/solution_0.json"
)
# solution_0_ckpt_config_path = (
# solution_dir / "solutions--checkpoints/solution_0/config.json"
# )
# assert solution_0_ckpt_config_path.exists()
# assert (solution_dir / "solutions.json").exists()

assert solution_0_filepath.exists()
# # Validate lm_loss
# _assert_lm_loss(puzzle_dir, hf_config_name)
# else:
# # assertions for the score_pruning_activations step 1 (FFN pruning)
# _assert_score_pruning_activations(puzzle_dir, hf_config_name)

# assertions for the mip_and_realize_models step 6
solution_0_ckpt_config_path = (
puzzle_dir
/ "mip/puzzle_solutions/target_memory_780000MiB/solutions--checkpoints/solution_0/config.json"
)
# # assertions for the pruning_ckpts step 2
# assert (puzzle_dir / "ckpts/ffn_256_attn_no_op").exists()

# # assertions for the mip_and_realize_models step 6
# _assert_mip_solutions(puzzle_dir, hf_config_name)

assert solution_0_ckpt_config_path.exists()
assert (puzzle_dir / "mip/puzzle_solutions/target_memory_780000MiB/solutions.json").exists()
# # assertions for the build_library_and_stats step 4
# assert (puzzle_dir / "replacement_library.json").is_file()
# assert (puzzle_dir / "subblock_stats.json").is_file()

# # assertions for the scoring step 5
# solution_0_filepath = (
# puzzle_dir / "single_sequence_replacement_solutions--validation/solution_0.json"
# )
# assert solution_0_filepath.exists()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test currently no longer validates Puzzletron pipeline behavior.

Lines 122-182 comment out compression and all assertions, so the test mainly verifies conversion and can pass while pruning/MIP/realization regress.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 122 - 182, The
test disabled the core Puzzletron pipeline (puzzletron.puzzletron invocation)
and all downstream assertions, so restore the call to
puzzletron.puzzletron(str(hydra_config_dir), hydra_config_subdir,
str(puzzle_dir), str(dataset_path)) and re-enable the assertions block that
checks both MoE and non-MoE paths; ensure the branches referencing
has_moe_layers and the helper validators (_assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions) are active again, plus
the file/directory existence checks (replacement_library.json,
subblock_stats.json, solution_0.json and the mip solution directory checks) so
the test verifies pruning/MIP/realization behavior end-to-end.

Comment on lines +248 to +269
expected = EXPECTED_PRUNING_VALUES[hf_config_name]
size = dist.size()

if expected is not None:
# In multi-GPU: layers are distributed across ranks
# Each rank processes len(expected) // size layers
expected_layers_per_rank = len(expected) // size
assert len(layer_names) == expected_layers_per_rank, (
f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}"
)
# Check each layer's values
for i, layer_name in enumerate(layer_names):
layer_data = pruning_scores[layer_name]
# Calculate global layer index from rank and local index
global_idx = rank * expected_layers_per_rank + i
assert layer_data["score"][0].item() == expected[global_idx]["score"]
assert (
layer_data["channels_importance_ascending"][0].item()
== expected[global_idx]["channels"]
)
else:
# Print values for new models - update EXPECTED_PRUNING_VALUES with these
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use .get() for expected pruning values before None fallback.

Line 248 uses direct indexing (EXPECTED_PRUNING_VALUES[hf_config_name]), which raises KeyError before the fallback branch on Line 251 can run.

Proposed fix
-    expected = EXPECTED_PRUNING_VALUES[hf_config_name]
+    expected = EXPECTED_PRUNING_VALUES.get(hf_config_name)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expected = EXPECTED_PRUNING_VALUES[hf_config_name]
size = dist.size()
if expected is not None:
# In multi-GPU: layers are distributed across ranks
# Each rank processes len(expected) // size layers
expected_layers_per_rank = len(expected) // size
assert len(layer_names) == expected_layers_per_rank, (
f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}"
)
# Check each layer's values
for i, layer_name in enumerate(layer_names):
layer_data = pruning_scores[layer_name]
# Calculate global layer index from rank and local index
global_idx = rank * expected_layers_per_rank + i
assert layer_data["score"][0].item() == expected[global_idx]["score"]
assert (
layer_data["channels_importance_ascending"][0].item()
== expected[global_idx]["channels"]
)
else:
# Print values for new models - update EXPECTED_PRUNING_VALUES with these
expected = EXPECTED_PRUNING_VALUES.get(hf_config_name)
size = dist.size()
if expected is not None:
# In multi-GPU: layers are distributed across ranks
# Each rank processes len(expected) // size layers
expected_layers_per_rank = len(expected) // size
assert len(layer_names) == expected_layers_per_rank, (
f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}"
)
# Check each layer's values
for i, layer_name in enumerate(layer_names):
layer_data = pruning_scores[layer_name]
# Calculate global layer index from rank and local index
global_idx = rank * expected_layers_per_rank + i
assert layer_data["score"][0].item() == expected[global_idx]["score"]
assert (
layer_data["channels_importance_ascending"][0].item()
== expected[global_idx]["channels"]
)
else:
# Print values for new models - update EXPECTED_PRUNING_VALUES with these
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 248 - 269, The
test currently indexes EXPECTED_PRUNING_VALUES with hf_config_name which raises
KeyError before the None branch can run; change the lookup to use
EXPECTED_PRUNING_VALUES.get(hf_config_name) (or .get(hf_config_name, None)) when
assigning expected so missing keys return None and the existing fallback branch
executes; update the reference where expected is set (the assignment using
EXPECTED_PRUNING_VALUES and hf_config_name) and leave the subsequent logic that
checks if expected is not None unchanged.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py (1)

360-364: ⚠️ Potential issue | 🟠 Major

max_layer_workers can become 0 and crash executor creation.

If owned_block_indexes is empty, Line 363 computes max_layer_workers = 0; ThreadPoolExecutor(max_workers=0) raises ValueError at Line 452.

Proposed fix
     if max_layer_workers is None:
         cpu_count = os.cpu_count() or 1
         num_layers = len(owned_block_indexes)
-        max_layer_workers = min(cpu_count, num_layers)
+        max_layer_workers = max(1, min(cpu_count, num_layers))
         mprint(
             f"Auto-calculated layer workers: min({cpu_count} CPUs, {num_layers} layers) = {max_layer_workers}"
         )

Also applies to: 452-453

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
360 - 364, The code computes max_layer_workers from cpu_count and num_layers
(len(owned_block_indexes)) and can set it to 0 when owned_block_indexes is
empty; change the computation to ensure at least 1 (e.g., max_layer_workers =
max(1, min(cpu_count, num_layers))) and also guard the ThreadPoolExecutor
creation (ThreadPoolExecutor(max_workers=...)) to never be called with 0—either
use the adjusted max_layer_workers or skip/avoid creating the executor when
num_layers == 0. Update all occurrences where max_layer_workers is derived or
passed into ThreadPoolExecutor to use this nonzero-safe logic (references:
max_layer_workers, owned_block_indexes, ThreadPoolExecutor).
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py (2)

71-72: descriptor is currently unused in per-layer processing.

Line 72 adds descriptor, but it is not consumed in the fallback path and not forwarded to prune_single_layer. Consider removing it from _process_single_layer or passing it through to keep the extension API consistent.

Also applies to: 96-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
71 - 72, The parameter descriptor was added but never used or forwarded in
per-layer processing; either remove descriptor from _process_single_layer's
signature or pass it through into prune_single_layer and the fallback path so
extensions receive it. Update _process_single_layer (and its call sites) to
either drop descriptor entirely or include descriptor when invoking
prune_single_layer and any fallback branches (e.g., the branch handling
pruning_mixin) to keep the extension API consistent with the new parameter.

900-902: owned_block_indexes is typed optional but used as required.

Line 901 defaults to None, yet Line 968 iterates it directly. Either remove optionality from the signature or initialize a default inside the function for standalone calls.

Also applies to: 968-968

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
900 - 902, The function signature declares owned_block_indexes:
Optional[list[int]] = None but the function (in child_init.py, the function that
returns dict[str, torch.Tensor]) iterates owned_block_indexes at runtime; fix by
either removing Optional and making the parameter required (owned_block_indexes:
list[int]) or keep the optional but initialize a safe default before use (e.g.,
inside the function set owned_block_indexes = owned_block_indexes or [] / if
owned_block_indexes is None: owned_block_indexes = []) so the later iteration
over owned_block_indexes at the iteration site (where it's used) will never
fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 549-550: The function accepts mlp_init_config: Optional[dict] but
the score-based expert pruning branch uses it unguarded and will raise on None
when concatenating routed experts; add a guard in the pruning path (where
score-based expert pruning is executed) to validate mlp_init_config is not None
before using it, and either (a) skip/disable the score-based pruning logic when
mlp_init_config is None, or (b) raise a clear ValueError indicating
mlp_init_config is required, or (c) supply a safe default dict; update the code
around the score-based pruning block (the logic referencing mlp_init_config and
the routed experts concatenation) accordingly so no NoneType access occurs.
- Around line 130-137: The code filters attn_keys to only those present in
new_state_dict then immediately re-checks those same filtered keys, which always
makes the branch true and prevents the linear-attn fallback; fix by preserving
the original list of attention candidate keys (e.g., orig_attn_keys = attn_keys
before filtering) and change the second conditional to test against
orig_attn_keys (or explicitly check for the presence of q/k/v/o names in
new_state_dict) instead of the already-filtered attn_keys so the fallback can be
selected when q/k/v/o are absent; make the same correction for the analogous
logic around is_student_and_teacher_have_same_attention_implementation in the
second block (lines ~177-183) and update references to keys, keys_to_remove,
ignored_keys, new_state_dict, and
is_student_and_teacher_have_same_attention_implementation accordingly.
- Around line 41-55: The call site incorrectly invokes
_load_expert_scores(mlp_init_config)[layer_idx] even though _load_expert_scores
requires (mlp_init_config, layer_idx); replace that pattern with the canonical
helper _select_expert_indices(mlp_init_config, layer_idx) which is already
imported and encapsulates the correct expert selection logic (use the same
layer_idx variable in the current context where the faulty call occurs).

---

Outside diff comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 360-364: The code computes max_layer_workers from cpu_count and
num_layers (len(owned_block_indexes)) and can set it to 0 when
owned_block_indexes is empty; change the computation to ensure at least 1 (e.g.,
max_layer_workers = max(1, min(cpu_count, num_layers))) and also guard the
ThreadPoolExecutor creation (ThreadPoolExecutor(max_workers=...)) to never be
called with 0—either use the adjusted max_layer_workers or skip/avoid creating
the executor when num_layers == 0. Update all occurrences where
max_layer_workers is derived or passed into ThreadPoolExecutor to use this
nonzero-safe logic (references: max_layer_workers, owned_block_indexes,
ThreadPoolExecutor).

---

Nitpick comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 71-72: The parameter descriptor was added but never used or
forwarded in per-layer processing; either remove descriptor from
_process_single_layer's signature or pass it through into prune_single_layer and
the fallback path so extensions receive it. Update _process_single_layer (and
its call sites) to either drop descriptor entirely or include descriptor when
invoking prune_single_layer and any fallback branches (e.g., the branch handling
pruning_mixin) to keep the extension API consistent with the new parameter.
- Around line 900-902: The function signature declares owned_block_indexes:
Optional[list[int]] = None but the function (in child_init.py, the function that
returns dict[str, torch.Tensor]) iterates owned_block_indexes at runtime; fix by
either removing Optional and making the parameter required (owned_block_indexes:
list[int]) or keep the optional but initialize a safe default before use (e.g.,
inside the function set owned_block_indexes = owned_block_indexes or [] / if
owned_block_indexes is None: owned_block_indexes = []) so the later iteration
over owned_block_indexes at the iteration site (where it's used) will never
fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a6e21cfa-5a6c-477b-be0e-01128b089f6f

📥 Commits

Reviewing files that changed from the base of the PR and between 2099df3 and eb5cf8a.

📒 Files selected for processing (1)
  • modelopt/torch/puzzletron/tools/bypassed_training/child_init.py

Comment on lines +130 to 137
attn_keys = [key for key in attn_keys if key in new_state_dict.keys()]
if len(attn_keys) > 0 and all(key in keys for key in attn_keys):
for key in attn_keys:
keys_to_remove[key] = keys[key]
if all(key not in ignored_keys for key in attn_keys):
is_student_and_teacher_have_same_attention_implementation = all(
key in new_state_dict for key in attn_keys
key in new_state_dict.keys() for key in attn_keys
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Linear-attention initialization path is effectively bypassed.

Line 130 filters attn_keys to keys present in new_state_dict, then Line 136 checks the same condition again, so that branch is always true for non-empty attn_keys. This prevents the linear_attn fallback path from being selected when q/k/v/o are absent.

Proposed fix
-        attn_keys = [q_key, k_key, v_key, o_key]
-        # Drop attn keys that don't exist and required to be in the new state_dict
-        attn_keys = [key for key in attn_keys if key in new_state_dict.keys()]
-        if len(attn_keys) > 0 and all(key in keys for key in attn_keys):
+        full_attn_keys = [q_key, k_key, v_key, o_key]
+        has_qkvo_in_student = all(key in new_state_dict for key in full_attn_keys)
+        if has_qkvo_in_student and all(key in keys for key in full_attn_keys):
-            for key in attn_keys:
+            for key in full_attn_keys:
                 keys_to_remove[key] = keys[key]
-            if all(key not in ignored_keys for key in attn_keys):
-                is_student_and_teacher_have_same_attention_implementation = all(
-                    key in new_state_dict.keys() for key in attn_keys
-                )
-                if is_student_and_teacher_have_same_attention_implementation:
+            if all(key not in ignored_keys for key in full_attn_keys):
+                if has_qkvo_in_student:
                     ...
                 else:
                     ...
+        else:
+            linear_attn_key = f"{attn_prefix}.linear_attn.weight"
+            if linear_attn_key in new_state_dict:
+                ...

Also applies to: 177-183

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
130 - 137, The code filters attn_keys to only those present in new_state_dict
then immediately re-checks those same filtered keys, which always makes the
branch true and prevents the linear-attn fallback; fix by preserving the
original list of attention candidate keys (e.g., orig_attn_keys = attn_keys
before filtering) and change the second conditional to test against
orig_attn_keys (or explicitly check for the presence of q/k/v/o names in
new_state_dict) instead of the already-filtered attn_keys so the fallback can be
selected when q/k/v/o are absent; make the same correction for the analogous
logic around is_student_and_teacher_have_same_attention_implementation in the
second block (lines ~177-183) and update references to keys, keys_to_remove,
ignored_keys, new_state_dict, and
is_student_and_teacher_have_same_attention_implementation accordingly.

Comment on lines +549 to 550
mlp_init_config: Optional[dict],
hidden_size: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

mlp_init_config is optional now, but score-based pruning path assumes it is populated.

Line 549 allows None, but Line 595 enters score-based expert pruning without validating config availability. This can fail with NoneType access when concatenating routed experts.

Proposed fix
     else:
         print(
             f"Concatenating {num_concatenated_routed_experts} routed experts to the shared expert in layer {layer_idx}"
         )
+        if mlp_init_config is None:
+            raise ValueError(
+                "mlp_init_config is required for expert-score-based concatenation."
+            )
         router_key, orig_experts_keys = _generate_moe_keys(
             layer_idx, parent_moe_config.num_local_experts
         )

Also applies to: 595-601

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
549 - 550, The function accepts mlp_init_config: Optional[dict] but the
score-based expert pruning branch uses it unguarded and will raise on None when
concatenating routed experts; add a guard in the pruning path (where score-based
expert pruning is executed) to validate mlp_init_config is not None before using
it, and either (a) skip/disable the score-based pruning logic when
mlp_init_config is None, or (b) raise a clear ValueError indicating
mlp_init_config is required, or (c) supply a safe default dict; update the code
around the score-based pruning block (the logic referencing mlp_init_config and
the routed experts concatenation) accordingly so no NoneType access occurs.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
modelopt/torch/puzzletron/pruning/pruning_utils.py (1)

195-200: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Harden torch.load usage for activation logs.

At Line 199, torch.load(p) deserializes pickle content without weights_only=True or an inline safety justification. This is still a security risk for .pth ingestion.

🔒 Proposed fix
         ACTIVATIONS_LOG.update(
             {
                 module_name: module_log
                 for p in Path(activations_log_dir).glob("rank_*.pth")
-                for module_name, module_log in torch.load(p).items()
+                # Safe: rank_*.pth files are internally generated by hook dumps, not user-supplied.
+                for module_name, module_log in torch.load(p, weights_only=True).items()
             }
         )
#!/bin/bash
rg -nP 'torch\.load\(' modelopt/torch/puzzletron/pruning/pruning_utils.py -C2

As per coding guidelines: "torch.load(..., weights_only=False) with no inline comment justifying why it is safe (e.g. confirming the file is internally-generated and not user-supplied)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 195 - 200,
The ACTIVATIONS_LOG population uses torch.load(p) unsafely; change the call
inside the dict comprehension that iterates
Path(activations_log_dir).glob("rank_*.pth") to use torch.load(p,
weights_only=True) to avoid pickle deserialization, updating the code that
builds module_name/module_log entries; if weights_only=True is infeasible, add
an inline comment justifying why full deserialization is safe and instead
validate the file is internal/trusted before calling torch.load, or replace the
load with a safe parser that only reads tensors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 556-559: The loop assumes orig_router_weights[name] always exists
which causes KeyError for router keys present only in new_router_weights; update
the code in the loop over new_router_weights to first try
orig_router_weights.get(name) and, if missing, attempt a parent-key fallback
(e.g., parent_name = name.rsplit('.', 1)[0] and use
orig_router_weights.get(parent_name)); if neither exists, handle gracefully
(skip, use an empty/default list, or raise a clear error). Ensure you reference
and use the same variables: new_router_weights, orig_router_weights,
result_router_weights, name, and tensor_to_slice so the selection expression
remains result_router_weights[name] = [tensor_to_slice[selected_experts] for
tensor_to_slice in chosen_orig_list].
- Around line 599-603: selected_experts currently treats NaN scores as
float("inf") which makes NaNs rank as best when
mlp_init_config.get("higher_is_better", True) is True; fix by computing
higher_is_better = mlp_init_config.get("higher_is_better", True) before the sort
and change the key so NaNs map to (-inf if higher_is_better else +inf) instead
of always +inf; update the lambda used in the sorted call that references
expert_scores, orig_num_experts, and new_num_experts to return math.nan-safe
sentinel values based on higher_is_better so NaNs are treated as worst rather
than best.
- Around line 189-201: _cache_activations_log currently never refreshes
ACTIVATIONS_LOG after the first call, causing stale activations to be reused;
update the function to detect a new activations_log_dir and reload by clearing
or replacing ACTIVATIONS_LOG (and optionally store the last-loaded dir in a
module-level variable) before repopulating it via the
Path(...).glob("rank_*.pth") + torch.load loop; reference the function name
_cache_activations_log and the ACTIVATIONS_LOG global and ensure you either
clear ACTIVATIONS_LOG.update(...) or reassign ACTIVATIONS_LOG so subsequent
calls with a different activations_log_dir actually reload the files.

---

Duplicate comments:
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 195-200: The ACTIVATIONS_LOG population uses torch.load(p)
unsafely; change the call inside the dict comprehension that iterates
Path(activations_log_dir).glob("rank_*.pth") to use torch.load(p,
weights_only=True) to avoid pickle deserialization, updating the code that
builds module_name/module_log entries; if weights_only=True is infeasible, add
an inline comment justifying why full deserialization is safe and instead
validate the file is internal/trusted before calling torch.load, or replace the
load with a safe parser that only reads tensors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 08e6808f-b3bc-4046-8904-f7a5deaa4014

📥 Commits

Reviewing files that changed from the base of the PR and between eb5cf8a and c9de41c.

📒 Files selected for processing (1)
  • modelopt/torch/puzzletron/pruning/pruning_utils.py

Comment on lines +189 to +201
def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
if len(ACTIVATIONS_LOG) == 0:
assert "activations_log_dir" in mlp_init_config
activations_log_dir = mlp_init_config["activations_log_dir"]
print(f"Loading activations_log from {activations_log_dir}")
# Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files
ACTIVATIONS_LOG.update(
{
module_name: module_log
for p in Path(activations_log_dir).glob("rank_*.pth")
for module_name, module_log in torch.load(p).items()
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cache invalidation bug: activation logs never refresh for a new directory.

After first load, _cache_activations_log ignores later activations_log_dir changes, so pruning can silently use stale scores.

♻️ Proposed fix
 ACTIVATIONS_LOG = dict()
+ACTIVATIONS_LOG_DIR: Optional[str] = None
 
 
 def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
-    if len(ACTIVATIONS_LOG) == 0:
-        assert "activations_log_dir" in mlp_init_config
-        activations_log_dir = mlp_init_config["activations_log_dir"]
+    global ACTIVATIONS_LOG_DIR
+    assert "activations_log_dir" in mlp_init_config
+    activations_log_dir = mlp_init_config["activations_log_dir"]
+    if len(ACTIVATIONS_LOG) == 0 or ACTIVATIONS_LOG_DIR != activations_log_dir:
+        ACTIVATIONS_LOG.clear()
+        ACTIVATIONS_LOG_DIR = activations_log_dir
         print(f"Loading activations_log from {activations_log_dir}")
         # Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files
         ACTIVATIONS_LOG.update(
             {
                 module_name: module_log
                 for p in Path(activations_log_dir).glob("rank_*.pth")
                 for module_name, module_log in torch.load(p).items()
             }
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
if len(ACTIVATIONS_LOG) == 0:
assert "activations_log_dir" in mlp_init_config
activations_log_dir = mlp_init_config["activations_log_dir"]
print(f"Loading activations_log from {activations_log_dir}")
# Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files
ACTIVATIONS_LOG.update(
{
module_name: module_log
for p in Path(activations_log_dir).glob("rank_*.pth")
for module_name, module_log in torch.load(p).items()
}
)
ACTIVATIONS_LOG = dict()
ACTIVATIONS_LOG_DIR: Optional[str] = None
def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
global ACTIVATIONS_LOG_DIR
assert "activations_log_dir" in mlp_init_config
activations_log_dir = mlp_init_config["activations_log_dir"]
if len(ACTIVATIONS_LOG) == 0 or ACTIVATIONS_LOG_DIR != activations_log_dir:
ACTIVATIONS_LOG.clear()
ACTIVATIONS_LOG_DIR = activations_log_dir
print(f"Loading activations_log from {activations_log_dir}")
# Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files
ACTIVATIONS_LOG.update(
{
module_name: module_log
for p in Path(activations_log_dir).glob("rank_*.pth")
for module_name, module_log in torch.load(p).items()
}
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 189 - 201,
_cache_activations_log currently never refreshes ACTIVATIONS_LOG after the first
call, causing stale activations to be reused; update the function to detect a
new activations_log_dir and reload by clearing or replacing ACTIVATIONS_LOG (and
optionally store the last-loaded dir in a module-level variable) before
repopulating it via the Path(...).glob("rank_*.pth") + torch.load loop;
reference the function name _cache_activations_log and the ACTIVATIONS_LOG
global and ensure you either clear ACTIVATIONS_LOG.update(...) or reassign
ACTIVATIONS_LOG so subsequent calls with a different activations_log_dir
actually reload the files.

Comment on lines +556 to +559
for name, new_list in new_router_weights.items():
result_router_weights[name] = [
tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Router fallback is not implemented and can crash with missing parent key.

At Line 558, orig_router_weights[name] is always used, despite the comment saying child fallback is supported. If a router key exists only in new_router_weights, this raises KeyError.

🐛 Proposed fix
     result_router_weights: dict[str, list[torch.Tensor]] = {}
     for name, new_list in new_router_weights.items():
-        result_router_weights[name] = [
-            tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
-        ]
+        if name in orig_router_weights:
+            result_router_weights[name] = [
+                tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
+            ]
+        else:
+            # Child-only tensors (e.g., newly introduced bias) keep child initialization.
+            result_router_weights[name] = new_list
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for name, new_list in new_router_weights.items():
result_router_weights[name] = [
tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
]
result_router_weights: dict[str, list[torch.Tensor]] = {}
for name, new_list in new_router_weights.items():
if name in orig_router_weights:
result_router_weights[name] = [
tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
]
else:
# Child-only tensors (e.g., newly introduced bias) keep child initialization.
result_router_weights[name] = new_list
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 556 - 559,
The loop assumes orig_router_weights[name] always exists which causes KeyError
for router keys present only in new_router_weights; update the code in the loop
over new_router_weights to first try orig_router_weights.get(name) and, if
missing, attempt a parent-key fallback (e.g., parent_name = name.rsplit('.',
1)[0] and use orig_router_weights.get(parent_name)); if neither exists, handle
gracefully (skip, use an empty/default list, or raise a clear error). Ensure you
reference and use the same variables: new_router_weights, orig_router_weights,
result_router_weights, name, and tensor_to_slice so the selection expression
remains result_router_weights[name] = [tensor_to_slice[selected_experts] for
tensor_to_slice in chosen_orig_list].

Comment on lines +599 to +603
selected_experts = sorted(
range(orig_num_experts),
key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else float("inf"),
reverse=mlp_init_config.get("higher_is_better", True),
)[:new_num_experts]
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

NaN ranking currently selects NaNs as “best” experts by default.

At Line 601, NaN is mapped to inf; with higher_is_better=True (Line 602), reverse sort puts NaNs first.

✅ Proposed fix
-    selected_experts = sorted(
+    higher_is_better = mlp_init_config.get("higher_is_better", True)
+    nan_sentinel = float("-inf") if higher_is_better else float("inf")
+    selected_experts = sorted(
         range(orig_num_experts),
-        key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else float("inf"),
-        reverse=mlp_init_config.get("higher_is_better", True),
+        key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else nan_sentinel,
+        reverse=higher_is_better,
     )[:new_num_experts]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 599 - 603,
selected_experts currently treats NaN scores as float("inf") which makes NaNs
rank as best when mlp_init_config.get("higher_is_better", True) is True; fix by
computing higher_is_better = mlp_init_config.get("higher_is_better", True)
before the sort and change the key so NaNs map to (-inf if higher_is_better else
+inf) instead of always +inf; update the lambda used in the sorted call that
references expert_scores, orig_num_experts, and new_num_experts to return
math.nan-safe sentinel values based on higher_is_better so NaNs are treated as
worst rather than best.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa this bug fix seems important. PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
tests/gpu/torch/puzzletron/test_puzzletron.py (2)

248-251: ⚠️ Potential issue | 🟡 Minor

Use .get() to preserve your fallback path.

Line 248 uses direct indexing, which can raise KeyError before Line 251’s None check executes.

Proposed fix
-    expected = EXPECTED_PRUNING_VALUES[hf_config_name]
+    expected = EXPECTED_PRUNING_VALUES.get(hf_config_name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 248 - 251,
Replace the direct dictionary access EXPECTED_PRUNING_VALUES[hf_config_name]
with a safe lookup using EXPECTED_PRUNING_VALUES.get(hf_config_name) so that the
variable expected can be None and the subsequent if expected is not None check
remains reachable; update the assignment to use .get(...) for hf_config_name
when computing expected in the test_puzzletron.py block.

122-182: ⚠️ Potential issue | 🟠 Major

Re-enable the core Puzzletron flow (currently false-green test).

Line 122 through Line 182 comment out the compression call and all end-to-end assertions, so this test can pass while pruning/MIP/realization behavior regresses. Please restore this block (or explicitly pytest.skip/xfail the test until it is restored).

Proposed fix
+from modelopt.torch import puzzletron
...
-    # TODO commented for the duration of merging process  from dkorzekwa/any_model to feature/puzzletron
-    # # Compress the model using a one-click approach
-    # puzzletron.puzzletron(
-    #     str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path)
-    # )
+    puzzletron.puzzletron(
+        str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path)
+    )
...
-    # if rank == 0:
-    #     ...
+    if rank == 0:
+        ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 122 - 182,
Restore the commented-out end-to-end Puzzletron flow by re-enabling the call to
puzzletron.puzzletron(...) and the subsequent rank==0 assertions (including
checks that reference _assert_lm_loss, _assert_score_pruning_activations,
_assert_mip_solutions, and file existence assertions for
replacement_library.json, subblock_stats.json, and solution_0.json); if the full
flow cannot be reliably executed in CI, replace the restored block with an
explicit pytest.skip or pytest.xfail that documents why the E2E flow is disabled
so the test suite fails loudly until the pruning/MIP/realize implementation is
fixed.
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)

200-210: ⚠️ Potential issue | 🟠 Major

Write the index after tied-embedding pruning, and use descriptor-derived key consistently.

Line 204 persists weight_map before Line 206-210 prunes tied output embedding entries. Also, Line 207 checks hardcoded "lm_head.weight" instead of the descriptor-derived output key, which can miss non-LLaMA naming.

🧩 Proposed fix
-    # Write index
-    index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
-    index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
-    index_json = json_dumps(index)
-    _write_file_process_safe(index_json, index_path)
-
     # Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens
-    if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict:
-        lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
+    lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
+    if getattr(model_config, "tie_word_embeddings", False) and lm_head_weight_name in state_dict:
         state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name}
         weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name}
+
+    # Write index after finalizing weight_map
+    index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
+    index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
+    index_json = json_dumps(index)
+    _write_file_process_safe(index_json, index_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 200 -
210, The index is written before tied-embedding pruning and the code checks the
hardcoded "lm_head.weight" key; update checkpoint_utils_hf.py so you first
perform the tie_word_embeddings pruning using model_config.tie_word_embeddings
and the descriptor-derived key lm_head_weight_name =
f"{descriptor.output_embedding_name()}.weight" to remove entries from state_dict
and weight_map, and only then build and write the index (index = {"metadata":
{"format": "pt"}, "weight_map": weight_map}, json_dumps(index),
_write_file_process_safe) using SAFE_WEIGHTS_INDEX_NAME; ensure you also stop
checking the literal "lm_head.weight" and use the descriptor-derived
lm_head_weight_name consistently.
🧹 Nitpick comments (2)
tests/gpu/torch/puzzletron/test_puzzletron.py (2)

295-299: Fail fast for unknown EXPECTED_LM_LOSS entries.

Like the pruning helper, this print-only fallback lets tests pass without validating LM loss for newly added models.

Proposed fix
-    else:
-        # Print value for new models - update EXPECTED_LM_LOSS with this
-        print(f"\n=== LM_LOSS for {hf_config_name} ===")
-        print(f'"{hf_config_name}": {actual_lm_loss},')
-        print("===")
+    else:
+        pytest.fail(
+            f"Missing EXPECTED_LM_LOSS baseline for '{hf_config_name}'. "
+            f"Observed lm_loss={actual_lm_loss}. Add baseline before enabling this config."
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 295 - 299, The
current else branch that only prints LM loss for unknown hf_config_name lets
tests pass without validating against EXPECTED_LM_LOSS; replace the print-only
fallback with a hard fail: when hf_config_name is not in EXPECTED_LM_LOSS, raise
an AssertionError (or use pytest.fail) that includes hf_config_name and the
observed actual_lm_loss and instructs the maintainer to add an entry to
EXPECTED_LM_LOSS; update the code paths around EXPECTED_LM_LOSS lookup (where
hf_config_name and actual_lm_loss are used) so unknown configs do not silently
pass.

269-278: Avoid silent pass when pruning baselines are missing.

This branch only prints values and allows the test to pass, which weakens regression detection for new models. Prefer failing with a clear message so baseline updates are explicit.

Proposed fix
-    else:
-        # Print values for new models - update EXPECTED_PRUNING_VALUES with these
-        print(f"\n=== PRUNING VALUES for {hf_config_name} (num_layers={len(layer_names)}) ===")
-        print(f'"{hf_config_name}": [')
-        for layer_name in layer_names:
-            layer_data = pruning_scores[layer_name]
-            score = layer_data["score"][0].item()
-            channels = layer_data["channels_importance_ascending"][0].item()
-            print(f'    {{"score": {score}, "channels": {channels}}},')
-        print("],")
-        print("===")
+    else:
+        pytest.fail(
+            f"Missing EXPECTED_PRUNING_VALUES baseline for '{hf_config_name}'. "
+            "Add expected score/channel values before enabling this config."
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 269 - 278,
Replace the silent-print-only branch that prints pruning values with an explicit
check against the expected baselines: verify that the current hf_config_name
exists in EXPECTED_PRUNING_VALUES and that each layer's reported pruning
score/channels in pruning_scores matches the stored baseline (or raise an
AssertionError with a clear message if the baseline is missing or mismatched).
Locate where pruning_scores and layer_names are used (the print block) and
change it to assert hf_config_name in EXPECTED_PRUNING_VALUES, then iterate
layer_names comparing pruning_scores[layer_name]["score"] and
["channels_importance_ascending"] to the EXPECTED_PRUNING_VALUES entry; if
absent or different, raise an error instructing the reviewer to update
EXPECTED_PRUNING_VALUES. Ensure the error message includes hf_config_name and
which layer failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 471-477: The function save_model_config currently mutates
model_config.block_configs in place which alters the live model.config; instead,
create a deep copy of the PretrainedConfig (use copy.deepcopy) and perform the
dataclass-to-dict conversion on the copy’s block_configs, then call
save_pretrained on the copied config (reference: save_model_config,
model_config.block_configs, and model_config.save_pretrained). This preserves
the original model_config and avoids side effects for callers.
- Around line 108-117: The function force_cache_dynamic_modules currently always
loads remote dynamic modules via get_class_from_dynamic_module; update its
signature to accept a trust_remote_code: bool parameter, then add an early
return when trust_remote_code is False to avoid invoking
get_class_from_dynamic_module; update the caller in load_model_config to pass
through its trust_remote_code argument so dynamic-module caching only runs when
trust_remote_code is True.

---

Duplicate comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 200-210: The index is written before tied-embedding pruning and
the code checks the hardcoded "lm_head.weight" key; update
checkpoint_utils_hf.py so you first perform the tie_word_embeddings pruning
using model_config.tie_word_embeddings and the descriptor-derived key
lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" to remove
entries from state_dict and weight_map, and only then build and write the index
(index = {"metadata": {"format": "pt"}, "weight_map": weight_map},
json_dumps(index), _write_file_process_safe) using SAFE_WEIGHTS_INDEX_NAME;
ensure you also stop checking the literal "lm_head.weight" and use the
descriptor-derived lm_head_weight_name consistently.

In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 248-251: Replace the direct dictionary access
EXPECTED_PRUNING_VALUES[hf_config_name] with a safe lookup using
EXPECTED_PRUNING_VALUES.get(hf_config_name) so that the variable expected can be
None and the subsequent if expected is not None check remains reachable; update
the assignment to use .get(...) for hf_config_name when computing expected in
the test_puzzletron.py block.
- Around line 122-182: Restore the commented-out end-to-end Puzzletron flow by
re-enabling the call to puzzletron.puzzletron(...) and the subsequent rank==0
assertions (including checks that reference _assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions, and file existence
assertions for replacement_library.json, subblock_stats.json, and
solution_0.json); if the full flow cannot be reliably executed in CI, replace
the restored block with an explicit pytest.skip or pytest.xfail that documents
why the E2E flow is disabled so the test suite fails loudly until the
pruning/MIP/realize implementation is fixed.

---

Nitpick comments:
In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 295-299: The current else branch that only prints LM loss for
unknown hf_config_name lets tests pass without validating against
EXPECTED_LM_LOSS; replace the print-only fallback with a hard fail: when
hf_config_name is not in EXPECTED_LM_LOSS, raise an AssertionError (or use
pytest.fail) that includes hf_config_name and the observed actual_lm_loss and
instructs the maintainer to add an entry to EXPECTED_LM_LOSS; update the code
paths around EXPECTED_LM_LOSS lookup (where hf_config_name and actual_lm_loss
are used) so unknown configs do not silently pass.
- Around line 269-278: Replace the silent-print-only branch that prints pruning
values with an explicit check against the expected baselines: verify that the
current hf_config_name exists in EXPECTED_PRUNING_VALUES and that each layer's
reported pruning score/channels in pruning_scores matches the stored baseline
(or raise an AssertionError with a clear message if the baseline is missing or
mismatched). Locate where pruning_scores and layer_names are used (the print
block) and change it to assert hf_config_name in EXPECTED_PRUNING_VALUES, then
iterate layer_names comparing pruning_scores[layer_name]["score"] and
["channels_importance_ascending"] to the EXPECTED_PRUNING_VALUES entry; if
absent or different, raise an error instructing the reviewer to update
EXPECTED_PRUNING_VALUES. Ensure the error message includes hf_config_name and
which layer failed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d963718-efe6-4ef4-9905-984c62cc8508

📥 Commits

Reviewing files that changed from the base of the PR and between c9de41c and 6cc2194.

📒 Files selected for processing (2)
  • modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
  • tests/gpu/torch/puzzletron/test_puzzletron.py

Comment on lines +471 to 477
def save_model_config(model_config: PretrainedConfig, checkpoint_dir: Path | str) -> None:
if hasattr(model_config, "block_configs"):
model_config.block_configs = [
dataclasses.asdict(conf) if dataclasses.is_dataclass(conf) else conf
for conf in model_config.block_configs
]
model_config.save_pretrained(checkpoint_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid mutating model.config while preparing it for serialization.

save_model_config rewrites model_config.block_configs in place. Since _save_checkpoint passes model.config, this mutates the live model config after save and can break callers that expect dataclass entries.

♻️ Proposed fix
+import copy
@@
 def save_model_config(model_config: PretrainedConfig, checkpoint_dir: Path | str) -> None:
-    if hasattr(model_config, "block_configs"):
-        model_config.block_configs = [
+    config_to_save = copy.deepcopy(model_config)
+    if hasattr(config_to_save, "block_configs"):
+        config_to_save.block_configs = [
             dataclasses.asdict(conf) if dataclasses.is_dataclass(conf) else conf
-            for conf in model_config.block_configs
+            for conf in config_to_save.block_configs
         ]
-    model_config.save_pretrained(checkpoint_dir)
+    config_to_save.save_pretrained(checkpoint_dir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 471 -
477, The function save_model_config currently mutates model_config.block_configs
in place which alters the live model.config; instead, create a deep copy of the
PretrainedConfig (use copy.deepcopy) and perform the dataclass-to-dict
conversion on the copy’s block_configs, then call save_pretrained on the copied
config (reference: save_model_config, model_config.block_configs, and
model_config.save_pretrained). This preserves the original model_config and
avoids side effects for callers.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py (1)

30-31: Avoid an unconditional long-lived skip for this integration test.

Line 30 disables coverage entirely; this can hide regressions until re-enable is forgotten. Prefer skipif with a concrete gate (or xfail) plus tracking reference.

Proposed tweak
-@pytest.mark.skip(reason="Temporarily disabled")
+@pytest.mark.skipif(
+    True,  # TODO: replace with real gate (e.g., feature flag/env check)
+    reason="Temporarily disabled; re-enable after AnyModel merge (track in issue `#XXXX`)",
+)
 def test_nas_search(project_root_path: Path, tmp_path: Path):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py` around lines 30 -
31, The unconditional decorator `@pytest.mark.skip` on the test_nas_search
function should be replaced with a conditional gate (e.g., pytest.mark.skipif)
or pytest.mark.xfail so the test is only skipped when a concrete condition is
met; update the decorator on test_nas_search to use a clear runtime condition
(for example an environment flag like os.getenv("RUN_INTEGRATION_TESTS") or a
pytest config marker) and include a concise reason string that references a
tracking ticket/PR ID (e.g., "skip unless RUN_INTEGRATION_TESTS=true — see
ISSUE-1234") so the skip is explicit, gated, and traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml`:
- Line 80: The target_memory value and inline comment disagree: keep them
consistent by updating the inline comment or the numeric literal in
llama_3_1_8b_instruct-attn-pruning.yaml so both represent the same amount; e.g.,
if the intended memory is seven hundred eighty thousand, change the comment next
to target_memory: 780_000 to "# 780_000", or if the intended value is
seventy‑eight thousand, change the numeric literal to target_memory: 78_000;
ensure underscore grouping matches the chosen value.

---

Nitpick comments:
In `@tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py`:
- Around line 30-31: The unconditional decorator `@pytest.mark.skip` on the
test_nas_search function should be replaced with a conditional gate (e.g.,
pytest.mark.skipif) or pytest.mark.xfail so the test is only skipped when a
concrete condition is met; update the decorator on test_nas_search to use a
clear runtime condition (for example an environment flag like
os.getenv("RUN_INTEGRATION_TESTS") or a pytest config marker) and include a
concise reason string that references a tracking ticket/PR ID (e.g., "skip
unless RUN_INTEGRATION_TESTS=true — see ISSUE-1234") so the skip is explicit,
gated, and traceable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b8f4094-9fc4-4ad2-998a-624d29b70d8a

📥 Commits

Reviewing files that changed from the base of the PR and between 449b523 and fafe5a3.

📒 Files selected for processing (4)
  • modelopt/torch/puzzletron/nas/plugins/puzzletron_nas_plugin.py
  • tests/gpu/torch/puzzletron/nas/plugins/test_nas_convert.py
  • tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py
  • tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/gpu/torch/puzzletron/nas/plugins/test_nas_convert.py

- stats.attention_num_params

human_constraints:
target_memory: 780_000 # 78_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve the target_memory value/comment mismatch.

Line 80 has 780_000 but the inline comment says 78_000. This 10x discrepancy can materially alter solution feasibility and test behavior.

Proposed change (pick one, value or comment)
-    target_memory: 780_000 # 78_000
+    target_memory: 78_000 # 78_000
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
target_memory: 780_000 # 78_000
target_memory: 78_000 # 78_000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml`
at line 80, The target_memory value and inline comment disagree: keep them
consistent by updating the inline comment or the numeric literal in
llama_3_1_8b_instruct-attn-pruning.yaml so both represent the same amount; e.g.,
if the intended memory is seven hundred eighty thousand, change the comment next
to target_memory: 780_000 to "# 780_000", or if the intended value is
seventy‑eight thousand, change the numeric literal to target_memory: 78_000;
ensure underscore grouping matches the chosen value.

@danielkorzekwa danielkorzekwa changed the title Add anymodel directories to feature/puzzletron Add anymodel-core to feature/puzzletron Mar 6, 2026
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
from torch import nn

import modelopt.torch.utils.distributed as dist
from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.block_config import BlockConfig # noqa: TC001
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the BlockConfig the only thing from deci_lm_hf_code we still use? Or are there other things also?

Comment on lines +1172 to +1173
TODO: Implement get_router_logits_and_routed_experts based on GPT-OSS MoE forward pass.
This is a placeholder implementation that allows the framework to run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be added in follow-up PRs or still not supported?

Comment on lines +54 to +55
if not pretrained:
raise ValueError("pretrained must be provided")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pretrained is a required argument, why dont we just remove | None in the type annotation?

Comment on lines +44 to +50
block_config = BlockConfig(
attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads),
ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
).to_dict()

block_configs = [block_config] * num_hidden_layers
return block_configs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa is this fine?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a one line docstring what puzzformer module does?

Comment on lines +184 to +193
index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
index_json = json_dumps(index)
_write_file_process_safe(index_json, index_path)

# Phase 4: Copy HF code
phase4_start_time = time.time()
copy_deci_lm_hf_code(checkpoint_dir)
phase4_time = time.time() - phase4_start_time
mprint(f"Phase 4 - Copy HF code: {phase4_time:.2f}s")
# Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens
if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict:
lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name}
weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa this bug fix seems important

) -> tuple[Path, Path, Path]:
"""
Setup the test model and data for the puzzletron NAS search.
Setup the test model and data for the compress NAS search.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this name change diff from compress to puzzletron

# Create and save the Llama model
model = LlamaForCausalLM(llama_config)
# VL models have nested configs (text_config, vision_config)
if hf_config_name == "qwen3-vl-30b-a3b-instruct":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we instead do if hasattr(config, "text_config") and hasattr(config, "vision_config")? That would be more robust

Comment on lines +8 to +9
layer_descriptor:
_target_: modelopt.torch.puzzletron.anymodel.models.llama.llama_model_descriptor.LlamaFFNIntermediateLayerDescriptor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future improvement: infer layer descriptor automatically based on ckpt config

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file for ffn_pruning? Can you rename so its clear. Perhaps we can move common attn/ffn pruning components in a separate yaml instead of copying in 2 places?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants