Add anymodel-core to feature/puzzletron#974
Add anymodel-core to feature/puzzletron#974danielkorzekwa wants to merge 15 commits intofeature/puzzletronfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an AnyModel conversion and pruning subsystem: package-level converters and registries, ModelDescriptor abstractions with a Llama implementation, per-layer patcher and no-op utilities, extensive pruning mixins/hooks/utilities (FFN/KV/MoE), checkpoint/save refactor, dummy test modules, many test resources, and Megatron MoE hooks. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CLI as convert_model
participant DescFactory as ModelDescriptorFactory
participant ConvFactory as ConverterFactory
participant Converter
participant FS as File System
Client->>CLI: convert_model(input_dir, output_dir, converter)
CLI->>DescFactory: get(converter)
DescFactory-->>CLI: ModelDescriptor
CLI->>ConvFactory: get(converter)
ConvFactory-->>CLI: Converter class/instance
CLI->>Converter: convert(descriptor, input_dir, output_dir)
Converter->>FS: copy_checkpoint_files(input_dir, output_dir)
Converter->>FS: read/write configs
Converter->>Converter: convert_configs_in_dirs()
Converter->>FS: read weight map (index or single file)
Converter->>Converter: convert_model_weights()
Converter->>FS: write subblock safetensors and index
Converter-->>CLI: conversion complete
CLI-->>Client: success
sequenceDiagram
participant Layer as Decoder Layer Init
participant Patcher as deci_x_patcher
participant Stack as Call Stack
participant Descriptor as ModelDescriptor
participant Config as Layer Config
Patcher->>Patcher: backup original __init__
Patcher->>Patcher: install patched __init__
Layer->>Patcher: __init__ invoked
Patcher->>Stack: inspect caller stack for layer index
Stack-->>Patcher: layer_index
Patcher->>Descriptor: block_config_to_layer_overrides(block_config[layer_index])
Descriptor-->>Patcher: overrides dict
Patcher->>Config: override_config_with_block_configs(config, overrides)
Config-->>Patcher: merged_config
Patcher->>Patcher: call original __init__(merged_config)
Patcher->>Descriptor: apply attn/mlp no-op post-init if configured
Patcher->>Patcher: restore original __init__ on exit
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
c45647f to
e82164f
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py (1)
89-120: Consider cachinglayer_name_predicatesresult for repeated calls.The method creates
2 + 2*num_layerscompiled regex patterns on each call. For models with many layers (e.g., 80+ layers), this generates 160+ patterns. Ifget_weight_groupsis called multiple times with the samenum_layers, caching could improve performance.This is a minor optimization since conversion is typically a one-time operation.
♻️ Optional caching approach
from functools import lru_cache `@staticmethod` `@lru_cache`(maxsize=4) def layer_name_predicates(num_layers: int) -> Dict[str, re.Pattern]: # ... existing implementation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py` around lines 89 - 120, The layer_name_predicates method recompiles the same regexes on every call which is wasteful for repeated get_weight_groups runs; make layer_name_predicates cached by adding functools.lru_cache (or similar) to the static method so calls with the same num_layers return the prebuilt Dict without rebuilding build_ffn_predicates/build_attention_predicates; import lru_cache, apply it to layer_name_predicates (keeping it static) and ensure the return type remains Dict[str, re.Pattern].modelopt/torch/puzzletron/anymodel/converter/converter.py (1)
116-131: Consider usinglogginginstead ofSimilar to the descriptor factory, the converter uses
print()statements for progress output. Using Python'sloggingmodule would provide better control for library consumers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 116 - 131, The converter uses print() calls for progress in the block that saves subblocks and writes model.safetensors.index.json (see the prints around the loop over tensors, the "subblock_file" usage, and the final print referencing output_dir); replace these prints with a module logger (e.g., add import logging and logger = logging.getLogger(__name__)) and call appropriate logger methods (logger.info or logger.debug) for the group/layer lines and the final status line so consumers can configure verbosity; keep the rest of the behavior (saving via save_file, updating new_index, writing the JSON) unchanged.modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)
49-51: Consider usinglogginginstead ofUsing
print()in library code makes it difficult for consumers to control output verbosity. Consider using Python'sloggingmodule for better configurability.♻️ Suggested change
+import logging + +logger = logging.getLogger(__name__) + # ... in the function: - print( - f"[resolve_descriptor_from_pretrained] Auto-detected model_type='{model_type}' → descriptor='{detected}'" - ) + logger.info( + f"Auto-detected model_type='{model_type}' → descriptor='{detected}'" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py` around lines 49 - 51, The print call in resolve_descriptor_from_pretrained should be replaced with the logging module so callers can control verbosity; import logging (or use module-level logger = logging.getLogger(__name__)) and change the print(...) that reports model_type and detected to a logger.debug or logger.info call (include model_type and detected in the message) so the same message is emitted via Python logging rather than stdout.modelopt/torch/puzzletron/anymodel/puzzformer/utils.py (1)
32-40: Stack inspection is fragile but acceptable for this internal use case.The
_get_variable_from_stackfunction relies on inspecting caller stack frames forlayer_idxoridxvariables. This works but is inherently fragile:
- Refactoring the caller code could break this silently
- Variable renaming would cause
RuntimeErrorConsider documenting the expected caller contract or using a thread-local context as an alternative if stability becomes a concern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 32 - 40, The _get_variable_from_stack function relies on fragile stack inspection for names like "layer_idx" and "idx"; update its docstring to explicitly document the caller contract (which variable names callers must set) and the expected call pattern, and as an alternative implement a stable thread-local context API (e.g., functions to set/get a current layer index using threading.local) and update callers to use that API instead of relying on stack variables so refactors/renames won't silently break; reference _get_variable_from_stack and the "layer_idx"/"idx" identifiers when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py`:
- Around line 31-62: The converter parameter is documented as accepting a
Converter class but the implementation calls ModelDescriptorFactory.get and
ConverterFactory.get and will break if a class is passed; change the API
contract to require a registered converter name (string) by validating that
converter is an instance of str at the start of convert_model and raising a
clear TypeError if not, update the docstring to state "converter is a registered
converter name (str)" and adjust any variable names/comments around
ModelDescriptorFactory.get and ConverterFactory.get to reflect that they expect
a registry key rather than a class.
In `@modelopt/torch/puzzletron/anymodel/converter/converter_factory.py`:
- Around line 70-75: The get method in converter_factory.py currently returns
unresolved strings which leads to late/opaque failures; change get(cls, value:
str | ModelDescriptor) to validate string names against CLASS_MAPPING and raise
a clear exception (e.g., ValueError or KeyError) when value is a str that is not
present in cls.CLASS_MAPPING, including the unknown name in the message; keep
returning the original object when value is already a ModelDescriptor/resolved
converter. Ensure you update the behavior of get and its docstring accordingly
so callers receive an immediate, descriptive failure for unknown converter
names.
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 100-113: The MoE conversion is indexing the source checkpoint dict
using converted_name which can differ from the original checkpoint keys; update
the branch in converter.py (inside the loop that handles MoE packed tensors) to
read from data using the original parameter name (the variable name) when
fetching the "_blocks" and "_scales" entries, and continue to use converted_name
only as the key for storing the result into tensors (i.e., use name to build the
data lookups like the "_blocks"/"_scales" access and assign
tensors[converted_name] = convert_moe_packed_tensors(...)).
In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 39-44: The defaulting of trust_remote_code to True in
resolve_descriptor_from_pretrained is unsafe; change the function signature to
default trust_remote_code to False and ensure the call to
AutoConfig.from_pretrained(pretrained, trust_remote_code=trust_remote_code)
continues to pass the parameter through so callers must opt in to remote code
execution; update any related callers/tests that relied on the old True default
to explicitly pass True when they intend to trust remote code.
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py`:
- Around line 44-50: The code currently builds a single BlockConfig dict
(block_config) and repeats its reference via [block_config] * num_hidden_layers
causing shared aliasing across layers; change this to produce independent dicts
per layer (e.g., build each layer's config inside a list comprehension or use a
deepcopy per iteration) so that block_configs is a list of distinct dicts,
referencing the BlockConfig/AttentionConfig/FFNConfig construction used earlier
(BlockConfig, block_config, block_configs, num_hidden_layers).
In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py`:
- Around line 43-47: Wrapped.forward currently does outputs[0] = result[0] which
breaks when result is a torch.Tensor (it slices the batch); change the
assignment to check the type of result and set outputs[0] to the entire tensor
when result is a Tensor, otherwise keep the original tuple/list element
semantics. Update the logic in the forward method of the wrapper
(function/method name forward, variables result and outputs, and size) to use an
isinstance(result, torch.Tensor) guard (import torch if needed) and assign
outputs[0] = result for tensors, else outputs[0] = result[0].
---
Nitpick comments:
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 116-131: The converter uses print() calls for progress in the
block that saves subblocks and writes model.safetensors.index.json (see the
prints around the loop over tensors, the "subblock_file" usage, and the final
print referencing output_dir); replace these prints with a module logger (e.g.,
add import logging and logger = logging.getLogger(__name__)) and call
appropriate logger methods (logger.info or logger.debug) for the group/layer
lines and the final status line so consumers can configure verbosity; keep the
rest of the behavior (saving via save_file, updating new_index, writing the
JSON) unchanged.
In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 49-51: The print call in resolve_descriptor_from_pretrained should
be replaced with the logging module so callers can control verbosity; import
logging (or use module-level logger = logging.getLogger(__name__)) and change
the print(...) that reports model_type and detected to a logger.debug or
logger.info call (include model_type and detected in the message) so the same
message is emitted via Python logging rather than stdout.
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`:
- Around line 89-120: The layer_name_predicates method recompiles the same
regexes on every call which is wasteful for repeated get_weight_groups runs;
make layer_name_predicates cached by adding functools.lru_cache (or similar) to
the static method so calls with the same num_layers return the prebuilt Dict
without rebuilding build_ffn_predicates/build_attention_predicates; import
lru_cache, apply it to layer_name_predicates (keeping it static) and ensure the
return type remains Dict[str, re.Pattern].
In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 32-40: The _get_variable_from_stack function relies on fragile
stack inspection for names like "layer_idx" and "idx"; update its docstring to
explicitly document the caller contract (which variable names callers must set)
and the expected call pattern, and as an alternative implement a stable
thread-local context API (e.g., functions to set/get a current layer index using
threading.local) and update callers to use that API instead of relying on stack
variables so refactors/renames won't silently break; reference
_get_variable_from_stack and the "layer_idx"/"idx" identifiers when making these
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ce8becd5-0dda-4383-8211-5200a7397c00
📒 Files selected for processing (13)
modelopt/torch/puzzletron/anymodel/converter/__init__.pymodelopt/torch/puzzletron/anymodel/converter/convert_any_model.pymodelopt/torch/puzzletron/anymodel/converter/converter.pymodelopt/torch/puzzletron/anymodel/converter/converter_factory.pymodelopt/torch/puzzletron/anymodel/model_descriptor/__init__.pymodelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor.pymodelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.pymodelopt/torch/puzzletron/anymodel/models/llama/__init__.pymodelopt/torch/puzzletron/anymodel/models/llama/llama_converter.pymodelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.pymodelopt/torch/puzzletron/anymodel/puzzformer/__init__.pymodelopt/torch/puzzletron/anymodel/puzzformer/no_op.pymodelopt/torch/puzzletron/anymodel/puzzformer/utils.py
| converter: Converter | str, | ||
| ): | ||
| """Convert a HuggingFace model to AnyModel format. | ||
|
|
||
| This function converts a HuggingFace checkpoint to the AnyModel format used | ||
| for compression. The conversion process: | ||
|
|
||
| 1. Copies non-weight files (config, tokenizer, etc.) | ||
| 2. Creates block_configs for each layer | ||
| 3. Reorganizes weights into subblock checkpoints | ||
|
|
||
| Args: | ||
| input_dir: Path to the input HuggingFace checkpoint directory. | ||
| output_dir: Path to the output AnyModel checkpoint directory. | ||
| converter: Either a converter name (e.g., "llama") or a Converter class. | ||
|
|
||
| Example: | ||
| >>> convert_model( | ||
| ... input_dir="/path/to/Llama-3.1-8B-Instruct", | ||
| ... output_dir="/path/to/output/ckpts/teacher", | ||
| ... converter="llama", | ||
| ... ) | ||
| """ | ||
| input_dir = Path(input_dir) | ||
| output_dir = Path(output_dir) | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # Get descriptor and converter from factories (they use the same name) | ||
| descriptor = ModelDescriptorFactory.get(converter) | ||
| converter = ConverterFactory.get(converter) | ||
|
|
||
| converter.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir) |
There was a problem hiding this comment.
converter API contract is currently inconsistent with implementation.
The function documents support for passing a Converter class, but the current resolution path expects a registry key and can produce an invalid descriptor object when a class is passed.
✅ Proposed fix (restrict and validate)
def convert_model(
input_dir: str,
output_dir: str,
- converter: Converter | str,
+ converter: str,
):
@@
- descriptor = ModelDescriptorFactory.get(converter)
- converter = ConverterFactory.get(converter)
-
- converter.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir)
+ descriptor = ModelDescriptorFactory.get(converter)
+ converter_cls = ConverterFactory.get(converter)
+ converter_cls.convert(descriptor=descriptor, input_dir=input_dir, output_dir=output_dir)Also update the docstring to state converter is a registered converter name.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py` around
lines 31 - 62, The converter parameter is documented as accepting a Converter
class but the implementation calls ModelDescriptorFactory.get and
ConverterFactory.get and will break if a class is passed; change the API
contract to require a registered converter name (string) by validating that
converter is an instance of str at the start of convert_model and raising a
clear TypeError if not, update the docstring to state "converter is a registered
converter name (str)" and adjust any variable names/comments around
ModelDescriptorFactory.get and ConverterFactory.get to reflect that they expect
a registry key rather than a class.
| def get(cls, value: str | ModelDescriptor): | ||
| """Get a registered converter by name or return the converter if already resolved.""" | ||
| if isinstance(value, str): | ||
| if value in cls.CLASS_MAPPING: | ||
| return cls.CLASS_MAPPING[value] | ||
| return value |
There was a problem hiding this comment.
Fail fast on unknown converter names in the factory.
Returning unresolved strings from get causes late, opaque runtime errors in downstream conversion calls.
🛠️ Proposed fix
`@classmethod`
def get(cls, value: str | ModelDescriptor):
"""Get a registered converter by name or return the converter if already resolved."""
if isinstance(value, str):
- if value in cls.CLASS_MAPPING:
- return cls.CLASS_MAPPING[value]
+ if value in cls.CLASS_MAPPING:
+ return cls.CLASS_MAPPING[value]
+ available = ", ".join(sorted(cls.CLASS_MAPPING.keys()))
+ raise KeyError(f"Unknown converter `{value}`. Available converters: [{available}]")
return value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/converter/converter_factory.py` around
lines 70 - 75, The get method in converter_factory.py currently returns
unresolved strings which leads to late/opaque failures; change get(cls, value:
str | ModelDescriptor) to validate string names against CLASS_MAPPING and raise
a clear exception (e.g., ValueError or KeyError) when value is a str that is not
present in cls.CLASS_MAPPING, including the unknown name in the message; keep
returning the original object when value is already a ModelDescriptor/resolved
converter. Ensure you update the behavior of get and its docstring accordingly
so callers receive an immediate, descriptive failure for unknown converter
names.
| # Convert MoE packed tensors if quantized is mxfp4 //gpt-oss-20b | ||
| if getattr(cls, "quantized", None) == "mxfp4": | ||
| if name.endswith("_blocks"): | ||
| converted_name = converted_name.replace("_blocks", "") | ||
| tensors[converted_name] = convert_moe_packed_tensors( | ||
| data[converted_name + "_blocks"], | ||
| data[converted_name + "_scales"], | ||
| ) | ||
| elif name.endswith("_scales"): | ||
| continue | ||
| else: | ||
| tensors[converted_name] = data[name] | ||
| else: | ||
| tensors[converted_name] = data[name] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's locate and examine the converter.py file
fd -t f "converter.py" -p "modelopt/torch/puzzletron"Repository: NVIDIA/Model-Optimizer
Length of output: 190
🏁 Script executed:
#!/bin/bash
# Examine the converter.py file around lines 100-113 with broader context
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | head -150 | tail -100Repository: NVIDIA/Model-Optimizer
Length of output: 4863
🏁 Script executed:
#!/bin/bash
# Search for convert_moe_packed_tensors function definition and usage
rg -n "def convert_moe_packed_tensors|convert_moe_packed_tensors" --type=py -B2 -A5Repository: NVIDIA/Model-Optimizer
Length of output: 1960
🏁 Script executed:
#!/bin/bash
# Look for tests related to MoE conversion
fd -t f "test_*.py" -o "*_test.py" | xargs rg -l "moe|MoE" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 5021
🏁 Script executed:
#!/bin/bash
# Search for convert_weight_name implementation
rg -n "def convert_weight_name" --type=py -A10Repository: NVIDIA/Model-Optimizer
Length of output: 1214
🏁 Script executed:
#!/bin/bash
# Search for subclasses that override convert_weight_name
rg -n "def convert_weight_name" --type=py -A5Repository: NVIDIA/Model-Optimizer
Length of output: 742
🏁 Script executed:
#!/bin/bash
# Search for mxfp4 quantization usage and MoE conversion tests
rg -n "mxfp4|convert_moe_packed_tensors" --type=py -B3 -A3Repository: NVIDIA/Model-Optimizer
Length of output: 19846
🏁 Script executed:
#!/bin/bash
# Look for llama converter which might use the base Converter
cat -n modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 2273
🏁 Script executed:
#!/bin/bash
# Search for actual MoE checkpoint structure documentation or examples
rg -n "moe.*blocks.*scales|_blocks.*_scales" --type=py -B2 -A2 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1230
🏁 Script executed:
#!/bin/bash
# Check if there are any actual tests that exercise the MoE conversion path with real data
rg -n "convert_model_weights.*mxfp4|gpt-oss-20b" --type=py -B3 -A3Repository: NVIDIA/Model-Optimizer
Length of output: 3999
🏁 Script executed:
#!/bin/bash
# Verify the logic by examining the exact flow: what names come from param_to_file
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | sed -n '65,130p'Repository: NVIDIA/Model-Optimizer
Length of output: 3434
🏁 Script executed:
#!/bin/bash
# Check how param_to_file is built to understand what names it contains
cat -n modelopt/torch/puzzletron/anymodel/converter/converter.py | sed -n '43,63p'Repository: NVIDIA/Model-Optimizer
Length of output: 1158
🏁 Script executed:
#!/bin/bash
# Verify by checking if any existing code might override convert_weight_name for MoE models
rg -n "class.*Converter.*:|quantized.*=" modelopt/torch/puzzletron --type=py -A5 | grep -A5 "quantized"Repository: NVIDIA/Model-Optimizer
Length of output: 3733
🏁 Script executed:
#!/bin/bash
# Final verification: check the exact structure of what gets passed to convert_moe_packed_tensors
rg -n "def convert_moe_packed_tensors" --type=py -A10 -B2Repository: NVIDIA/Model-Optimizer
Length of output: 48
Use original parameter name to construct data dict keys in MoE conversion logic.
Lines 105-106 access data[converted_name + "_blocks"] and data[converted_name + "_scales"], but the data dict is keyed by original checkpoint parameter names. The converted_name is derived from applying convert_weight_name() transformation, so any subclass override that transforms the parameter prefix would cause a KeyError at runtime.
Fix: Construct data keys using the original name parameter:
data[name],
data[name.replace("_blocks", "_scales")]Reserve converted_name for storing results in the tensors output dict. This ensures data access always uses original checkpoint names while transformations are applied only to output keys.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 100 -
113, The MoE conversion is indexing the source checkpoint dict using
converted_name which can differ from the original checkpoint keys; update the
branch in converter.py (inside the loop that handles MoE packed tensors) to read
from data using the original parameter name (the variable name) when fetching
the "_blocks" and "_scales" entries, and continue to use converted_name only as
the key for storing the result into tensors (i.e., use name to build the data
lookups like the "_blocks"/"_scales" access and assign tensors[converted_name] =
convert_moe_packed_tensors(...)).
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
Outdated
Show resolved
Hide resolved
| block_config = BlockConfig( | ||
| attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads), | ||
| ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size), | ||
| ).to_dict() | ||
|
|
||
| block_configs = [block_config] * num_hidden_layers | ||
| return block_configs |
There was a problem hiding this comment.
Avoid shared dict aliasing across all layer block configs.
[block_config] * num_hidden_layers duplicates references, not values. A mutation in one layer config will affect all layers.
♻️ Proposed fix
- block_config = BlockConfig(
- attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads),
- ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
- ).to_dict()
-
- block_configs = [block_config] * num_hidden_layers
+ block_configs = [
+ BlockConfig(
+ attention=AttentionConfig(
+ no_op=False, num_key_value_heads=config.num_key_value_heads
+ ),
+ ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size),
+ ).to_dict()
+ for _ in range(num_hidden_layers)
+ ]
return block_configs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py` around
lines 44 - 50, The code currently builds a single BlockConfig dict
(block_config) and repeats its reference via [block_config] * num_hidden_layers
causing shared aliasing across layers; change this to produce independent dicts
per layer (e.g., build each layer's config inside a list comprehension or use a
deepcopy per iteration) so that block_configs is a list of distinct dicts,
referencing the BlockConfig/AttentionConfig/FFNConfig construction used earlier
(BlockConfig, block_config, block_configs, num_hidden_layers).
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)
39-44:⚠️ Potential issue | 🔴 CriticalCRITICAL:
trust_remote_codedefaults toTrue, violating security guidelines.Per coding guidelines and SECURITY.md,
trust_remote_code=Truemust not be hardcoded. This flag enables execution of arbitrary Python shipped with a checkpoint, which is an RCE vector if the model source is untrusted. The parameter should default toFalse.🔒 Proposed fix
-def resolve_descriptor_from_pretrained(pretrained: str | None, trust_remote_code: bool = True): +def resolve_descriptor_from_pretrained(pretrained: str | None, trust_remote_code: bool = False): """Resolve the model descriptor by loading the checkpoint config and mapping model_type."""Based on learnings: "trust_remote_code parameter for transformers model or tokenizer loading must be exposed as a caller-configurable parameter defaulting to False, not hardcoded to True"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py` around lines 39 - 44, The function resolve_descriptor_from_pretrained currently hardcodes trust_remote_code=True which is a security risk; change its default to False and ensure the value is forwarded to AutoConfig.from_pretrained so callers must explicitly opt-in when they trust the remote code. Update the function signature trust_remote_code: bool = False in model_descriptor_factory.resolve_descriptor_from_pretrained and keep using that variable in the AutoConfig.from_pretrained call; audit any callsites of resolve_descriptor_from_pretrained and only pass trust_remote_code=True when the model source is trusted.modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py (1)
44-50:⚠️ Potential issue | 🟠 MajorAvoid shared dict aliasing across all layer block configs.
[block_config] * num_hidden_layerscreates a list where all elements reference the same dict object. Any mutation to one layer's config would affect all layers.♻️ Proposed fix
- block_config = BlockConfig( - attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads), - ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size), - ).to_dict() - - block_configs = [block_config] * num_hidden_layers + block_configs = [ + BlockConfig( + attention=AttentionConfig( + no_op=False, num_key_value_heads=config.num_key_value_heads + ), + ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size), + ).to_dict() + for _ in range(num_hidden_layers) + ] return block_configs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py` around lines 44 - 50, The code builds a single block_config dict and repeats it with [block_config] * num_hidden_layers causing aliasing; instead produce distinct dicts per layer by creating a fresh BlockConfig.to_dict() for each hidden layer (e.g., use a list comprehension or loop) so that block_configs = [BlockConfig(attention=AttentionConfig(...), ffn=FFNConfig(...)).to_dict() for _ in range(num_hidden_layers)] or at minimum use shallow/deep copy of block_config per iteration; update the code around BlockConfig, AttentionConfig, FFNConfig, block_config and block_configs to avoid shared references.modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py (1)
42-47:⚠️ Potential issue | 🔴 CriticalFix wrapper output handling for tensor-returning modules.
Wrapped.forwardunconditionally accessesresult[0], but when wrappingMatchingZeros, theresultis a tensor (not a tuple). This causesresult[0]to slice the batch dimension instead of returning the full tensor.🐛 Proposed fix
class Wrapped(cls): def forward(self, *args, **kwargs): result = super().forward(*args, **kwargs) + first_output = result[0] if isinstance(result, (tuple, list)) else result outputs = [None] * size - outputs[0] = result[0] + outputs[0] = first_output return tuple(outputs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py` around lines 42 - 47, Wrapped.forward assumes result is subscriptable tuple and does result[0], which breaks when the wrapped module (e.g., MatchingZeros) returns a tensor; change forward in class Wrapped to detect whether result is a tuple/list (isinstance(result, (tuple, list))) and if so set outputs[0] = result[0], otherwise set outputs[0] = result (for single-tensor returns) before returning tuple(outputs); ensure this logic uses the existing variables (Wrapped.forward, result, outputs, size) so the wrapper returns a tuple of length size in all cases.modelopt/torch/puzzletron/anymodel/converter/converter.py (1)
100-113:⚠️ Potential issue | 🟠 MajorUse original parameter name to construct data dict keys in MoE conversion logic.
Lines 105-106 access
data[converted_name + "_blocks"]anddata[converted_name + "_scales"], but thedatadict is keyed by original checkpoint parameter names. Ifconvert_weight_name()transforms the name, this would cause a KeyError at runtime.🐛 Proposed fix
if name.endswith("_blocks"): - converted_name = converted_name.replace("_blocks", "") + base_name = name.replace("_blocks", "") + converted_name = cls.convert_weight_name(base_name) tensors[converted_name] = convert_moe_packed_tensors( - data[converted_name + "_blocks"], - data[converted_name + "_scales"], + data[name], + data[base_name + "_scales"], )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/converter/converter.py` around lines 100 - 113, The MoE conversion uses converted_name to build keys into the original checkpoint dict which can cause KeyError if convert_weight_name changed the name; update the logic in the block that checks getattr(cls, "quantized", None) == "mxfp4" so that when calling convert_moe_packed_tensors you reference the original parameter names (use name + "_blocks" and name + "_scales") instead of converted_name + "_blocks"/"_scales", and keep the rest of the assignment to tensors[converted_name] intact; adjust the elif handling for name.endswith("_scales") to continue as before and ensure convert_moe_packed_tensors is invoked with data[name + "_blocks"], data[name + "_scales"].
🧹 Nitpick comments (3)
modelopt/torch/puzzletron/anymodel/puzzformer/utils.py (1)
32-40: Stack inspection is fragile; consider documenting expected calling convention.
_get_variable_from_stackrelies on callers havinglayer_idxoridxin their local scope. This works but is fragile if calling code changes. Consider adding a docstring note about the expected variable names in calling contexts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 32 - 40, The helper _get_variable_from_stack currently searches caller frames for local names "layer_idx" or "idx", which is fragile; update its docstring to document the expected calling convention and the exact local variable names it looks for (e.g., "layer_idx" and "idx"), including why those names are required and advising callers to set one of them in scope (or to call an alternative API) if they cannot; mention the function name _get_variable_from_stack and the expected variable names in the docstring so future maintainers know the dependency.modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py (2)
15-15: Avoid file-wide mypy suppression in a new public descriptor module.Line 15 (
# mypy: ignore-errors) suppresses all type diagnostics and can mask interface drift forModelDescriptoroverrides. Prefer localized# type: ignore[...]on unavoidable lines.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py` at line 15, Remove the file-wide "# mypy: ignore-errors" from llama_model_descriptor.py and instead fix type issues by making the class and method signatures conform to the ModelDescriptor interface (e.g., ensure LlamaModelDescriptor methods match ModelDescriptor's parameter and return types); if there are specific unavoidable mypy failures, add localized comments like "# type: ignore[assignment]" or "# type: ignore[arg-type]" only on the offending lines (target the exact methods/attributes in LlamaModelDescriptor or other ModelDescriptor overrides) and run mypy to verify no other hidden issues remain.
60-72: Add inline comments documenting the Transformers API contracts for clarity.The
size=2at line 62 correctly matchesLlamaAttention.forward()return structure (attn_output, attn_weights), and theLlamaRotaryEmbeddingconstructor at line 71 uses the correct current signature. Both align with the pinned Transformers versions (4.48–5.0 range). While the codebase already documents these patterns inModelDescriptorandno_op.py, adding brief inline comments here would make the API contract assumptions explicit for future maintainers. Consider:decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)() # LlamaAttention returns (attn_output, attn_weights)and
model.model.rotary_emb = LlamaRotaryEmbedding(model.config, runtime.device) # LlamaRotaryEmbedding(config, device) constructor🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py` around lines 60 - 72, Update attn_no_op_post_init and init_rotary_embedding with brief inline comments documenting the Transformers API contracts: next to decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)() add a comment that LlamaAttention.forward returns (attn_output, attn_weights) to justify size=2; next to model.model.rotary_emb = LlamaRotaryEmbedding(model.config, runtime.device) add a comment that the constructor signature is LlamaRotaryEmbedding(config, device). Keep comments concise and place them immediately after the referenced expressions in the attn_no_op_post_init and init_rotary_embedding functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 114-122: The inline comment in override_config_with_block_configs
has a typo ("Model initialization requires fails with None"); update it to a
clear, concise statement such as "Model initialization requires non-None values;
filter out None overrides" (or similar) to explain why block_configs entries
that are None are removed before updating _config; keep the rest of the function
unchanged and ensure the comment appears above the _config_overrides line
referencing the None-filtering behavior.
---
Duplicate comments:
In `@modelopt/torch/puzzletron/anymodel/converter/converter.py`:
- Around line 100-113: The MoE conversion uses converted_name to build keys into
the original checkpoint dict which can cause KeyError if convert_weight_name
changed the name; update the logic in the block that checks getattr(cls,
"quantized", None) == "mxfp4" so that when calling convert_moe_packed_tensors
you reference the original parameter names (use name + "_blocks" and name +
"_scales") instead of converted_name + "_blocks"/"_scales", and keep the rest of
the assignment to tensors[converted_name] intact; adjust the elif handling for
name.endswith("_scales") to continue as before and ensure
convert_moe_packed_tensors is invoked with data[name + "_blocks"], data[name +
"_scales"].
In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 39-44: The function resolve_descriptor_from_pretrained currently
hardcodes trust_remote_code=True which is a security risk; change its default to
False and ensure the value is forwarded to AutoConfig.from_pretrained so callers
must explicitly opt-in when they trust the remote code. Update the function
signature trust_remote_code: bool = False in
model_descriptor_factory.resolve_descriptor_from_pretrained and keep using that
variable in the AutoConfig.from_pretrained call; audit any callsites of
resolve_descriptor_from_pretrained and only pass trust_remote_code=True when the
model source is trusted.
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_converter.py`:
- Around line 44-50: The code builds a single block_config dict and repeats it
with [block_config] * num_hidden_layers causing aliasing; instead produce
distinct dicts per layer by creating a fresh BlockConfig.to_dict() for each
hidden layer (e.g., use a list comprehension or loop) so that block_configs =
[BlockConfig(attention=AttentionConfig(...), ffn=FFNConfig(...)).to_dict() for _
in range(num_hidden_layers)] or at minimum use shallow/deep copy of block_config
per iteration; update the code around BlockConfig, AttentionConfig, FFNConfig,
block_config and block_configs to avoid shared references.
In `@modelopt/torch/puzzletron/anymodel/puzzformer/no_op.py`:
- Around line 42-47: Wrapped.forward assumes result is subscriptable tuple and
does result[0], which breaks when the wrapped module (e.g., MatchingZeros)
returns a tensor; change forward in class Wrapped to detect whether result is a
tuple/list (isinstance(result, (tuple, list))) and if so set outputs[0] =
result[0], otherwise set outputs[0] = result (for single-tensor returns) before
returning tuple(outputs); ensure this logic uses the existing variables
(Wrapped.forward, result, outputs, size) so the wrapper returns a tuple of
length size in all cases.
---
Nitpick comments:
In `@modelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.py`:
- Line 15: Remove the file-wide "# mypy: ignore-errors" from
llama_model_descriptor.py and instead fix type issues by making the class and
method signatures conform to the ModelDescriptor interface (e.g., ensure
LlamaModelDescriptor methods match ModelDescriptor's parameter and return
types); if there are specific unavoidable mypy failures, add localized comments
like "# type: ignore[assignment]" or "# type: ignore[arg-type]" only on the
offending lines (target the exact methods/attributes in LlamaModelDescriptor or
other ModelDescriptor overrides) and run mypy to verify no other hidden issues
remain.
- Around line 60-72: Update attn_no_op_post_init and init_rotary_embedding with
brief inline comments documenting the Transformers API contracts: next to
decoder_layer.self_attn = return_tuple_of_size(MatchingZeros, size=2)() add a
comment that LlamaAttention.forward returns (attn_output, attn_weights) to
justify size=2; next to model.model.rotary_emb =
LlamaRotaryEmbedding(model.config, runtime.device) add a comment that the
constructor signature is LlamaRotaryEmbedding(config, device). Keep comments
concise and place them immediately after the referenced expressions in the
attn_no_op_post_init and init_rotary_embedding functions.
In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py`:
- Around line 32-40: The helper _get_variable_from_stack currently searches
caller frames for local names "layer_idx" or "idx", which is fragile; update its
docstring to document the expected calling convention and the exact local
variable names it looks for (e.g., "layer_idx" and "idx"), including why those
names are required and advising callers to set one of them in scope (or to call
an alternative API) if they cannot; mention the function name
_get_variable_from_stack and the expected variable names in the docstring so
future maintainers know the dependency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f1dc323b-8f82-4f6a-a989-849919373f3b
📒 Files selected for processing (13)
modelopt/torch/puzzletron/anymodel/converter/__init__.pymodelopt/torch/puzzletron/anymodel/converter/convert_any_model.pymodelopt/torch/puzzletron/anymodel/converter/converter.pymodelopt/torch/puzzletron/anymodel/converter/converter_factory.pymodelopt/torch/puzzletron/anymodel/model_descriptor/__init__.pymodelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor.pymodelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.pymodelopt/torch/puzzletron/anymodel/models/llama/__init__.pymodelopt/torch/puzzletron/anymodel/models/llama/llama_converter.pymodelopt/torch/puzzletron/anymodel/models/llama/llama_model_descriptor.pymodelopt/torch/puzzletron/anymodel/puzzformer/__init__.pymodelopt/torch/puzzletron/anymodel/puzzformer/no_op.pymodelopt/torch/puzzletron/anymodel/puzzformer/utils.py
🚧 Files skipped from review as they are similar to previous changes (4)
- modelopt/torch/puzzletron/anymodel/puzzformer/init.py
- modelopt/torch/puzzletron/anymodel/converter/convert_any_model.py
- modelopt/torch/puzzletron/anymodel/converter/converter_factory.py
- modelopt/torch/puzzletron/anymodel/models/llama/init.py
| def override_config_with_block_configs( | ||
| config: PretrainedConfig, block_configs: Dict[str, Any] | ||
| ) -> PretrainedConfig: | ||
| """Create a copy of config with block_config overrides applied.""" | ||
| _config = copy.deepcopy(config) | ||
| # Model initialization requires fails with None in case of no-ops | ||
| _config_overrides = {k: v for k, v in block_configs.items() if v is not None} | ||
| _config.update(_config_overrides) | ||
| return _config |
There was a problem hiding this comment.
Fix typo in comment.
Line 119 comment reads "Model initialization requires fails with None" which appears to be a typo. Consider clarifying.
📝 Proposed fix
_config = copy.deepcopy(config)
- # Model initialization requires fails with None in case of no-ops
+ # Model initialization fails with None values in case of no-ops
_config_overrides = {k: v for k, v in block_configs.items() if v is not None}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def override_config_with_block_configs( | |
| config: PretrainedConfig, block_configs: Dict[str, Any] | |
| ) -> PretrainedConfig: | |
| """Create a copy of config with block_config overrides applied.""" | |
| _config = copy.deepcopy(config) | |
| # Model initialization requires fails with None in case of no-ops | |
| _config_overrides = {k: v for k, v in block_configs.items() if v is not None} | |
| _config.update(_config_overrides) | |
| return _config | |
| def override_config_with_block_configs( | |
| config: PretrainedConfig, block_configs: Dict[str, Any] | |
| ) -> PretrainedConfig: | |
| """Create a copy of config with block_config overrides applied.""" | |
| _config = copy.deepcopy(config) | |
| # Model initialization fails with None values in case of no-ops | |
| _config_overrides = {k: v for k, v in block_configs.items() if v is not None} | |
| _config.update(_config_overrides) | |
| return _config |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/puzzformer/utils.py` around lines 114 -
122, The inline comment in override_config_with_block_configs has a typo ("Model
initialization requires fails with None"); update it to a clear, concise
statement such as "Model initialization requires non-None values; filter out
None overrides" (or similar) to explain why block_configs entries that are None
are removed before updating _config; keep the rest of the function unchanged and
ensure the comment appears above the _config_overrides line referencing the
None-filtering behavior.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/puzzletron #974 +/- ##
======================================================
- Coverage 72.12% 72.10% -0.03%
======================================================
Files 209 209
Lines 23628 23628
======================================================
- Hits 17042 17036 -6
- Misses 6586 6592 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
modelopt/torch/puzzletron/pruning/pruning_ckpts.py (3)
319-337:⚠️ Potential issue | 🟠 MajorHandle list-valued pruning mixins before type dispatch.
Line 319 resolves pruning mixins dynamically, but Lines 332-337 only branch on single-instance
isinstance(...). If resolver returns a list, valid configs will hit theNotImplementedErrorpath.Proposed fix
cfg.pruning.pruning_mixin = resolve_pruning_mixin(cfg.pruning.pruning_mixin, cfg.descriptor) pruning_mixin = cfg.pruning.pruning_mixin + if isinstance(pruning_mixin, list): + if len(pruning_mixin) != 1: + raise NotImplementedError( + f"checkpoint pruning currently supports exactly one pruning mixin, got {len(pruning_mixin)}" + ) + pruning_mixin = pruning_mixin[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 319 - 337, The code currently assumes resolve_pruning_mixin(cfg.pruning.pruning_mixin, cfg.descriptor) returns a single object (pruning_mixin) but may return a list; update the dispatch to handle list-valued mixins by detecting if pruning_mixin is a list/tuple and iterating over its elements, calling the appropriate launcher for each element (use the existing launch_ffn_intermediates_prune_ckpt, launch_attn_groups_prune_ckpt, and launch_experts_prune_ckpt), otherwise fall back to the single-instance isinstance checks; also validate element types and raise a clear NotImplementedError for unsupported mixin types so configs that return lists are processed correctly.
309-310:⚠️ Potential issue | 🟡 MinorCreate
puzzle_dir/ckptsbefore symlink creation in MOE FFN path.Line 310 creates a symlink under
cfg.puzzle_dir/ckptswithout ensuring the directory exists. This can fail on first run.Proposed fix
- # Create symlink in puzzle_dir/ckpts - os.symlink(output_dir, os.path.join(cfg.puzzle_dir, "ckpts", dirname)) + # Create symlink in puzzle_dir/ckpts + ckpt_path = os.path.join(cfg.puzzle_dir, "ckpts") + os.makedirs(ckpt_path, exist_ok=True) + os.symlink(output_dir, os.path.join(ckpt_path, dirname))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 309 - 310, The symlink creation using os.symlink(output_dir, os.path.join(cfg.puzzle_dir, "ckpts", dirname)) can fail if cfg.puzzle_dir/ckpts does not exist; before calling os.symlink in pruning_ckpts.py ensure the target directory exists by creating cfg.puzzle_dir/ckpts (e.g., call os.makedirs(os.path.join(cfg.puzzle_dir, "ckpts"), exist_ok=True) or equivalent) and then create the symlink, keeping the same variables output_dir and dirname.
179-203:⚠️ Potential issue | 🟠 MajorPass a dict (not a JSON string) to
model_config_overrides_dictin hidden-size pruning.Line 179 serializes overrides with
json.dumps(...), but Line 202 passes that value tomodel_config_overrides_dict. Other launch paths pass dictionaries; this hidden-dim path is likely sending the wrong type and can break override application.Proposed fix
- model_config_overrides_json = json.dumps( - { - "hidden_size": hidden_size, - "ffn": [ - { - "intermediate_size": intermediate_size, - } - for intermediate_size in intermediate_sizes - ], - } - ) + model_config_overrides_dict = { + "hidden_size": hidden_size, + "ffn": [ + {"intermediate_size": intermediate_size} + for intermediate_size in intermediate_sizes + ], + } @@ - mprint(f"Model config overrides: {model_config_overrides_json}") + mprint(f"Model config overrides: {json.dumps(model_config_overrides_dict)}") @@ - model_config_overrides_dict=model_config_overrides_json, + model_config_overrides_dict=model_config_overrides_dict,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py` around lines 179 - 203, The code currently serializes the model overrides into model_config_overrides_json with json.dumps(...) and then passes that JSON string into init_child_from_parent via the model_config_overrides_dict parameter; change model_config_overrides_json to be a Python dict (e.g., name it model_config_overrides or keep the variable but remove json.dumps) that contains {"hidden_size": hidden_size, "ffn": [ {"intermediate_size": intermediate_size} for intermediate_size in intermediate_sizes ]} and pass that dict to init_child_from_parent's model_config_overrides_dict so the override argument is a dict (not a JSON string); keep the rest (dirname, output_dir, mlp_init_config_yaml, and the call to init_child_from_parent) unchanged.
🧹 Nitpick comments (13)
modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py (4)
15-15: Consider removing or scoping the mypy ignore directive.A blanket
# mypy: ignore-errorssuppresses all type checking in this file, which reduces type safety for the substantial new code being added (the MoE expert hooks). Consider either removing this directive and fixing type issues, or using targeted# type: ignore[specific-error]comments only where necessary.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` at line 15, Remove the blanket module-level "# mypy: ignore-errors" at the top of base_hooks.py and either (a) delete it entirely and fix the reported type issues in the new MoE expert hooks, or (b) replace it with targeted per-line/type ignores (e.g., "# type: ignore[xxx]") only on the specific functions/classes that legitimately need suppression (the new MoE expert hook implementations). Run mypy to identify the exact errors, apply minimal scoped "# type: ignore[...]" comments next to the offending symbols (or adjust signatures/annotations in the MoE hook classes/functions) so the rest of the file retains full type checking.
1144-1166: Placeholder implementation needs completion.This is a no-op placeholder that returns zeroed router logits and unmodified hidden states, which means expert importance scoring won't work for Qwen3-VL models. The TODO at line 1147 should be tracked.
Would you like me to open an issue to track implementing
get_router_logits_and_routed_expertsbased on the Qwen3VLMoeSparseMoe forward pass?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1144 - 1166, The current Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts returns dummy zeros and unmodified hidden states; replace this placeholder with code that mirrors the Qwen3VLMoeSparseMoe forward pass to extract actual router logits and per-token routed expert outputs: locate the Qwen3VLMoeSparseMoe.forward implementation and reproduce its routing steps (compute router logits if router_logits arg is None, apply top-k gating/softmax or routing mask, gather the expert outputs into routed_experts with shape (batch*seq, hidden_dim) and return router_logits tensor shaped (batch*seq, self.num_local_experts) and routed_experts on the same device), ensure batch_size calculation matches that forward, respect self.num_local_experts and devices, and add unit tests or assertions validating shapes and non-zero router logits for routed cases in the Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts method.
1031-1049: Unbounded memory growth from accumulated tensors.
self.router_argsortappends a tensor on every forward pass without bounds. For long calibration runs with many tokens, this could lead to OOM. Consider:
- Adding a configurable limit with periodic aggregation
- Using memory-mapped storage for large-scale runs
- Documenting memory requirements based on expected token count
The
int16conversion is a good memory optimization, though it limitsnum_expertsto ≤32767.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1031 - 1049, The hook currently appends an unbounded tensor to self.router_argsort on every forward in __call__, causing OOM for long runs; change this by adding a configurable cap (e.g., max_entries or max_tokens) and implement eviction/aggregation behavior when the cap is reached (e.g., drop oldest entries, aggregate into a single summary tensor, or flush to disk/memory-map), expose the cap via activation_hooks_kwargs or the class init, and update __call__ to respect the cap when pushing router_argsort (ensure conversions to int16 remain but add a runtime check that num_experts ≤ 32767 and fall back to a larger dtype or error if exceeded); reference self.router_argsort, __call__, and the local router_argsort variable for where to make these changes.
1169-1202: Placeholder implementation needs completion.Similar to
Qwen3VLRemoveExpertsIndependentHook, this placeholder won't provide meaningful expert importance scores for GPT-OSS models. The extensive docstring is helpful but the implementation needs to be completed based onGptOssSparseMoeBlock.forward().Would you like me to open an issue to track implementing this hook, or would you like assistance sketching out the implementation based on the GPT-OSS MoE forward pass?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 1169 - 1202, Summary: The placeholder get_router_logits_and_routed_experts in class GptOssRemoveExpertsIndependentHook returns zeros and no-op expert outputs; it must be implemented to mirror GptOssSparseMoeBlock.forward() to produce meaningful router logits and routed expert outputs. Fix: update GptOssRemoveExpertsIndependentHook.get_router_logits_and_routed_experts to call or replicate the routing logic from GptOssSparseMoeBlock.forward() — accept an optional router_logits input or compute router logits using the same gating layer(s) used in GptOssSparseMoeBlock, return router_logits shaped (num_tokens, num_local_experts) and routed_experts as the actual per-token expert outputs (flatten/reshape to match forward’s routed expert representation), preserve device/dtype and batch/token ordering, and handle cases where hidden_states is (batch, seq_len, hidden_dim) or (num_tokens, hidden_dim); reference GptOssSparseMoeBlock.forward(), the get_router_logits_and_routed_experts method, and self.num_local_experts to locate where to implement the logic.modelopt/torch/puzzletron/utils/dummy_modules.py (2)
44-50: AlignDummyBlock.forwardreturn type with actual behavior.The method always returns
x, sotuple[torch.Tensor, None]in the annotation is misleading and weakens static checks.Proposed fix
- ) -> torch.Tensor | tuple[torch.Tensor, None]: + ) -> torch.Tensor: return x🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/utils/dummy_modules.py` around lines 44 - 50, The return type annotation of DummyBlock.forward is incorrect: it currently declares torch.Tensor | tuple[torch.Tensor, None] but the implementation always returns x; update the type hint on DummyBlock.forward to return torch.Tensor (remove the tuple union) to match actual behavior and improve static typing checks, adjusting any import/type references if needed.
30-33: Use a Protocol to type-hint the incompatible_keys parameter for PyTorch version resilience.
torch.nn.modules.module._IncompatibleKeysis a private API that may break across PyTorch versions. The public contract forregister_load_state_dict_post_hookguarantees an object with.missing_keysand.unexpected_keysattributes. Define and use a Protocol instead:from typing import Protocol class IncompatibleKeys(Protocol): missing_keys: list[str] unexpected_keys: list[str] `@staticmethod` def load_state_dict_post_hook( module: torch.nn.Module, incompatible_keys: IncompatibleKeys, ) -> None:This decouples from the private implementation detail while maintaining type safety.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/utils/dummy_modules.py` around lines 30 - 33, Replace the use of the private torch type in the load_state_dict_post_hook signature by defining a small Protocol (e.g., IncompatibleKeys with attributes missing_keys: list[str] and unexpected_keys: list[str]) and change the function parameter type to that Protocol; import Protocol from typing, remove the torch.nn.modules.module._IncompatibleKeys annotation, and keep the function name load_state_dict_post_hook and its behavior unchanged so the hook accepts any object exposing .missing_keys and .unexpected_keys.tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml (1)
13-14: Track TODO items for init mode compatibility.These TODO comments indicate known limitations with the
mlp_init_modeandgqa_init_modesettings. Consider creating issues to track these improvements so they aren't forgotten.Would you like me to open issues to track the CopyAsIs/FromTeacher compatibility work for these init modes?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml` around lines 13 - 14, Add tracked issues for the TODOs so the compatibility work for init modes isn't lost: create two issues referencing mlp_init_mode and gqa_init_mode respectively, describe current limitation ("Truncate" and "AverageKV" only) and request support for CopyAsIs and FromTeacher initialization paths, link to the config file and mention symbols CopyAsIs and FromTeacher, and assign/label them as enhancement and technical-debt so they can be prioritized.tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py (1)
30-62: Addif __name__ == "__main__"guard to prevent execution on import.This script executes immediately when imported, which could cause unintended side effects. Wrapping the execution logic in a main guard is a best practice, especially for utility scripts.
♻️ Proposed fix to add main guard
-with open(in_path, encoding="utf-8") as f: - tokenizer_data = json.load(f) - -# Get and sort the original vocab by index (frequency proxy) -orig_vocab = tokenizer_data["model"]["vocab"] - -# Sort tokens by their original index (lowest index = assumed most common/important) -sorted_tokens = sorted(orig_vocab.items(), key=lambda item: item[1]) - -# Keep the top N tokens -tokens_to_keep = [tok for tok, idx in sorted_tokens[:NUM_TO_KEEP]] - -# Re-index the selected tokens: 0..N-1 -small_vocab = {tok: i for i, tok in enumerate(tokens_to_keep)} -tokenizer_data["model"]["vocab"] = small_vocab - -# Update vocab size -if "vocab_size" in tokenizer_data["model"]: - tokenizer_data["model"]["vocab_size"] = len(small_vocab) - -# Optionally remove merges if present and unneeded (mostly for BPE/WordPiece) -if "merges" in tokenizer_data["model"]: - tokenizer_data["model"]["merges"] = [] - -# Remove added_tokens if not needed -if "added_tokens" in tokenizer_data: - tokenizer_data["added_tokens"] = [] - -# Write out the truncated tokenizer.json -with open(out_path, "w", encoding="utf-8") as f: - json.dump(tokenizer_data, f, indent=2, ensure_ascii=False) - -print(f"Truncated tokenizer saved to: {out_path}") +if __name__ == "__main__": + with open(in_path, encoding="utf-8") as f: + tokenizer_data = json.load(f) + + # Get and sort the original vocab by index (frequency proxy) + orig_vocab = tokenizer_data["model"]["vocab"] + + # Sort tokens by their original index (lowest index = assumed most common/important) + sorted_tokens = sorted(orig_vocab.items(), key=lambda item: item[1]) + + # Keep the top N tokens + tokens_to_keep = [tok for tok, idx in sorted_tokens[:NUM_TO_KEEP]] + + # Re-index the selected tokens: 0..N-1 + small_vocab = {tok: i for i, tok in enumerate(tokens_to_keep)} + tokenizer_data["model"]["vocab"] = small_vocab + + # Update vocab size + if "vocab_size" in tokenizer_data["model"]: + tokenizer_data["model"]["vocab_size"] = len(small_vocab) + + # Optionally remove merges if present and unneeded (mostly for BPE/WordPiece) + if "merges" in tokenizer_data["model"]: + tokenizer_data["model"]["merges"] = [] + + # Remove added_tokens if not needed + if "added_tokens" in tokenizer_data: + tokenizer_data["added_tokens"] = [] + + # Write out the truncated tokenizer.json + with open(out_path, "w", encoding="utf-8") as f: + json.dump(tokenizer_data, f, indent=2, ensure_ascii=False) + + print(f"Truncated tokenizer saved to: {out_path}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py` around lines 30 - 62, The file executes its truncation logic at import time; wrap the top-level code (the open(in_path)... json.dump(...) and print(...) sequence) into a main() function (or similar) and add an if __name__ == "__main__": guard that calls main(), ensuring variables like in_path, out_path and NUM_TO_KEEP are obtained/used inside main() so importing this module does not run the truncation immediately.modelopt/torch/puzzletron/pruning/pruning_mixin.py (1)
52-73: Remove or implement the commented-out code.The commented-out
prune_single_layerabstract method (lines 63-73) should either be implemented or removed to avoid confusion about the intended interface.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_mixin.py` around lines 52 - 73, The commented-out abstract method prune_single_layer in class PruningMixIn is ambiguous and should be resolved: either uncomment and declare it as an abstract method with the given signature (def prune_single_layer(self, layer_idx: int, parent_state_dict: dict, new_state_dict: dict, original_config: PretrainedConfig, new_config: PretrainedConfig, **kwargs) -> None: raise NotImplementedError) so implementers must provide pruning logic, or remove the entire commented block to avoid suggesting an interface that doesn't exist; update the class docstring or supported_hooks() docstring to document the chosen interface change and ensure references in callers (e.g., any code expecting PruningMixIn.prune_single_layer) are adjusted accordingly.modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py (1)
115-120: Document in-place mutation ofkeysparameter.The method mutates the
keysdict passed as an argument by popping entries. This side effect should be documented in the method's docstring or the parameter should be copied if mutation is unintended.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py` around lines 115 - 120, The loop mutates the passed-in dict parameter `keys` by popping entries (see loops referencing `parent_router_keys` and `orig_experts_keys`); either update the method/class docstring to explicitly state that `keys` is modified in-place, or avoid the side effect by operating on a shallow copy (e.g., work with a local copy of `keys` before the two pop loops) so callers' dicts are not mutated unexpectedly—reference the method in expert_removal_pruning_mixin (where `keys`, `parent_router_keys`, and `orig_experts_keys` are used) when making the change.modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py (1)
260-275: Consider validating list homogeneity.
maybe_cast_block_configschecks only the first element to determine if the list contains dicts. If the list contains mixed types (e.g.,[dict, BlockConfig]), this could lead to unexpected behavior.♻️ Proposed fix to handle mixed types
def maybe_cast_block_configs( block_configs: List[BlockConfig | dict] | None, ) -> List[BlockConfig] | None: if not block_configs: return block_configs - if isinstance(block_configs[0], dict): - return [BlockConfig(**conf) for conf in block_configs] - return block_configs + return [ + BlockConfig(**conf) if isinstance(conf, dict) else conf + for conf in block_configs + ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py` around lines 260 - 275, The function maybe_cast_block_configs currently inspects only block_configs[0]; change it to validate homogeneity across the list: iterate over block_configs and if every element is a dict, return [BlockConfig(**conf) for conf in block_configs]; if every element is a BlockConfig, return the list as-is; if the list contains mixed types, raise a TypeError (or ValueError) with a clear message mentioning maybe_cast_block_configs and BlockConfig so callers know the input must be all dicts or all BlockConfig instances; keep None/empty behavior unchanged.modelopt/torch/puzzletron/pruning/pruning_utils.py (1)
186-186: Consider thread-safety for global cache.
ACTIVATIONS_LOGis a global mutable dict that could cause race conditions in multi-process distributed training scenarios. Consider using a thread-safe approach or documenting the single-process assumption.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` at line 186, ACTIVATIONS_LOG is a global mutable dict that can race in multi-thread/process training; make it thread/process-safe by replacing it with a protected/shared structure (e.g., wrap accesses with a module-level threading.RLock or use a multiprocessing.Manager().dict() if you need cross-process visibility) and update every read/write to acquire the lock (or access the Manager dict) where ACTIVATIONS_LOG is used, or explicitly document in the module docstring that the module assumes single-process execution if you choose not to change behavior.modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py (1)
119-124: Fail fast on unknown descriptor names inModelDescriptorFactory.get.Returning the raw string for unknown descriptors delays errors to downstream code paths. Raising here gives clearer diagnostics at the source.
Proposed fix
`@classmethod` def get(cls, value: str | ModelDescriptor): """Get a registered model descriptor by name or return the descriptor if already resolved.""" if isinstance(value, str): - if value in cls.CLASS_MAPPING: - return cls.CLASS_MAPPING[value] + if value in cls.CLASS_MAPPING: + return cls.CLASS_MAPPING[value] + raise KeyError( + f"Unknown model descriptor '{value}'. " + f"Known descriptors: {sorted(cls.CLASS_MAPPING.keys())}" + ) return value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py` around lines 119 - 124, The get method in ModelDescriptorFactory currently returns the raw string when an unknown descriptor name is passed, which delays errors; modify ModelDescriptorFactory.get to check if value is a str and, if it's not present in ModelDescriptorFactory.CLASS_MAPPING, raise a clear exception (e.g., KeyError or ValueError) that includes the unknown name and a brief hint about valid keys, otherwise return the mapped descriptor; keep existing behavior of returning the passed-through ModelDescriptor when value is already resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py`:
- Around line 849-856: Avoid calling next(self.moe.parameters()) without a
fallback and stop reusing self.diffs as both list and dict: first extract the
keys into a local variable (e.g., diff_keys = ["mse", "cosine"]), then get a
safe param sample via sample_param = next(self.moe.parameters(), None) and if
None raise a clear error or handle it (e.g., create tensors on cpu or choose
device explicitly); finally set self.diffs = {k:
torch.zeros(size=(self.num_local_experts,), dtype=torch.float32,
device=sample_param.device if sample_param is not None else torch.device("cpu"))
for k in diff_keys} so initialization is clear and robust.
In `@modelopt/torch/puzzletron/anymodel/README.md`:
- Around line 97-99: Update the README.md end-to-end example link which
currently points to a non-existent
tests/gpu/torch/puzzletron/test_compress_model.py (or test_compress.py); change
the reference to the actual test file
tests/gpu/torch/puzzletron/test_puzzletron.py so the example link and
description in README.md correctly point to test_puzzletron.py.
In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py`:
- Around line 122-132: The code in the fused-experts branch (guarded by
layer_descriptor.is_fused_experts) assumes fused_keys lists are non-empty and
directly indexes fused_keys[0], which can raise IndexError; update the logic in
the block that populates orig_experts_weights and new_experts_weights to
validate fused_keys before using fused_keys[0] (e.g., skip or raise a clear
error when fused_keys is empty), referencing the variables
orig_experts_keys/new_experts_keys, parent_state_dict/new_state_dict and the
dicts orig_experts_weights/new_experts_weights so the fix is applied where those
are constructed.
In `@modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py`:
- Around line 84-100: The code accesses parent_state_dict[mlp_key] without
verifying the key exists, risking a KeyError; update the loop so before calling
_init_mlp_module you retrieve parent_val = parent_state_dict.get(mlp_key) (or
handle missing key explicitly) and pass parent_val instead of
parent_state_dict[mlp_key]; adjust _init_mlp_module invocation in this block
(the call that assigns mlp_module_weight, pruned_filters, projection_matrix) to
accept a possibly None parent value and ensure layer_out_state_dict[mlp_key] is
set only after a successful init.
- Around line 29-32: The imported _init_mlp_module is coming from
modelopt.torch.puzzletron.tools.bypassed_training.child_init but the code calls
it using the pruning_utils signature; update the import to pull _init_mlp_module
from modelopt.torch.puzzletron.pruning.pruning_utils (or the module where
pruning_utils lives) instead of child_init, keeping MlpInitMode import as needed
so the call in this file (which passes mlp_prefix, expanded_dim, layer_idx,
new_item, new_config, orig_item, original_config, mlp_init_config, etc.) matches
the function signature in pruning_utils; simply change the from ... import line
to import _init_mlp_module from pruning_utils so no other call-site changes are
required.
In `@modelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.py`:
- Line 75: The code in kv_heads_pruning_mixin.py directly reads
new_config.head_dim into head_size, which can raise if that attribute is absent;
update the assignment in the function/method where head_size is set (referencing
new_config and head_size in kv_heads_pruning_mixin) to use a fallback via
getattr(new_config, "head_dim", new_config.hidden_size //
new_config.num_attention_heads) so head_size is computed safely when head_dim is
not defined.
- Around line 71-73: The unpacking of q_name, k_name, v_name, o_name assumes
self.layer_descriptor.qkvo_weight_names has exactly four entries; validate the
length before unpacking (e.g., check
len(self.layer_descriptor.qkvo_weight_names) == 4) and raise a clear error or
fall back to a safe behavior; alternatively implement defensive unpacking by
iterating over qkvo_weight_names to build names or using default/fallback
values. Locate the code using self.layer_descriptor.qkvo_weight_names (in
KVHeadsLayerDescriptor and the KVHeadsPruningMixin where q_name, k_name, v_name,
o_name are assigned) and add the length check or defensive logic to prevent
ValueError on unexpected lengths.
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 195-201: The torch.load call inside the ACTIVATIONS_LOG.update
comprehension (iterating Path(activations_log_dir).glob("rank_*.pth")) must
either pass weights_only=True or include an inline comment justifying why
loading with executable objects is safe; update the torch.load(...) invocation
used to read module logs (the expression for module_name, module_log in
torch.load(p).items()) to torch.load(p, weights_only=True) or add a concise
comment immediately after the call explaining that files are internally
generated and never user-supplied, referencing activations_log_dir and the
ACTIVATIONS_LOG update for context.
- Around line 369-372: The code accesses a non-existent attribute
attention.n_heads_in_group (used to compute n_heads_in_group,
orig_n_heads_in_group, num_kv_heads, orig_num_kv_heads) which will raise
AttributeError; update pruning_utils.py to defensively read n_heads_in_group by
first checking hasattr or getattr on
new_config.block_configs[layer_idx].attention and
original_config.block_configs[layer_idx].attention, and if missing derive the
value from the existing num_key_value_heads field (or via the mapping used in
configuration_decilm.py) before computing num_kv_heads/orig_num_kv_heads so the
logic works when AttentionConfig only provides num_key_value_heads. Ensure you
reference the attention config objects used in pruning_utils.py
(new_config.block_configs[layer_idx].attention and
original_config.block_configs[layer_idx].attention) when implementing this
fallback.
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 184-193: The code writes the safetensors index (index_path /
SAFE_WEIGHTS_INDEX_NAME via _write_file_process_safe) before removing tied
output-embedding entries, and it checks the hardcoded "lm_head.weight" instead
of using the descriptor; move the tied-embedding handling block to run before
building/writing the index_json and index_path, and replace the hardcoded string
with the descriptor-derived name computed as lm_head_weight_name =
f"{descriptor.output_embedding_name()}.weight"; then filter both state_dict and
weight_map to drop that lm_head_weight_name so the persisted index and
weight_map are consistent with the saved tensors.
- Around line 130-132: The call to AutoConfig.from_pretrained hardcodes
trust_remote_code=True; update the surrounding function signature to accept a
new parameter trust_remote_code: bool = False (expose it to callers) and pass
that parameter into AutoConfig.from_pretrained instead of the literal True,
preserving existing kwargs (checkpoint_dir, model_config_overrides,
return_unused_kwargs=True) so callers can opt in to remote code only when
needed; ensure any callers of the function are updated or that the parameter is
backwards-compatible by using the default False.
In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 122-182: The test disabled the core Puzzletron pipeline
(puzzletron.puzzletron invocation) and all downstream assertions, so restore the
call to puzzletron.puzzletron(str(hydra_config_dir), hydra_config_subdir,
str(puzzle_dir), str(dataset_path)) and re-enable the assertions block that
checks both MoE and non-MoE paths; ensure the branches referencing
has_moe_layers and the helper validators (_assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions) are active again, plus
the file/directory existence checks (replacement_library.json,
subblock_stats.json, solution_0.json and the mip solution directory checks) so
the test verifies pruning/MIP/realization behavior end-to-end.
- Around line 248-269: The test currently indexes EXPECTED_PRUNING_VALUES with
hf_config_name which raises KeyError before the None branch can run; change the
lookup to use EXPECTED_PRUNING_VALUES.get(hf_config_name) (or
.get(hf_config_name, None)) when assigning expected so missing keys return None
and the existing fallback branch executes; update the reference where expected
is set (the assignment using EXPECTED_PRUNING_VALUES and hf_config_name) and
leave the subsequent logic that checks if expected is not None unchanged.
---
Outside diff comments:
In `@modelopt/torch/puzzletron/pruning/pruning_ckpts.py`:
- Around line 319-337: The code currently assumes
resolve_pruning_mixin(cfg.pruning.pruning_mixin, cfg.descriptor) returns a
single object (pruning_mixin) but may return a list; update the dispatch to
handle list-valued mixins by detecting if pruning_mixin is a list/tuple and
iterating over its elements, calling the appropriate launcher for each element
(use the existing launch_ffn_intermediates_prune_ckpt,
launch_attn_groups_prune_ckpt, and launch_experts_prune_ckpt), otherwise fall
back to the single-instance isinstance checks; also validate element types and
raise a clear NotImplementedError for unsupported mixin types so configs that
return lists are processed correctly.
- Around line 309-310: The symlink creation using os.symlink(output_dir,
os.path.join(cfg.puzzle_dir, "ckpts", dirname)) can fail if cfg.puzzle_dir/ckpts
does not exist; before calling os.symlink in pruning_ckpts.py ensure the target
directory exists by creating cfg.puzzle_dir/ckpts (e.g., call
os.makedirs(os.path.join(cfg.puzzle_dir, "ckpts"), exist_ok=True) or equivalent)
and then create the symlink, keeping the same variables output_dir and dirname.
- Around line 179-203: The code currently serializes the model overrides into
model_config_overrides_json with json.dumps(...) and then passes that JSON
string into init_child_from_parent via the model_config_overrides_dict
parameter; change model_config_overrides_json to be a Python dict (e.g., name it
model_config_overrides or keep the variable but remove json.dumps) that contains
{"hidden_size": hidden_size, "ffn": [ {"intermediate_size": intermediate_size}
for intermediate_size in intermediate_sizes ]} and pass that dict to
init_child_from_parent's model_config_overrides_dict so the override argument is
a dict (not a JSON string); keep the rest (dirname, output_dir,
mlp_init_config_yaml, and the call to init_child_from_parent) unchanged.
---
Nitpick comments:
In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py`:
- Line 15: Remove the blanket module-level "# mypy: ignore-errors" at the top of
base_hooks.py and either (a) delete it entirely and fix the reported type issues
in the new MoE expert hooks, or (b) replace it with targeted per-line/type
ignores (e.g., "# type: ignore[xxx]") only on the specific functions/classes
that legitimately need suppression (the new MoE expert hook implementations).
Run mypy to identify the exact errors, apply minimal scoped "# type:
ignore[...]" comments next to the offending symbols (or adjust
signatures/annotations in the MoE hook classes/functions) so the rest of the
file retains full type checking.
- Around line 1144-1166: The current
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts returns
dummy zeros and unmodified hidden states; replace this placeholder with code
that mirrors the Qwen3VLMoeSparseMoe forward pass to extract actual router
logits and per-token routed expert outputs: locate the
Qwen3VLMoeSparseMoe.forward implementation and reproduce its routing steps
(compute router logits if router_logits arg is None, apply top-k gating/softmax
or routing mask, gather the expert outputs into routed_experts with shape
(batch*seq, hidden_dim) and return router_logits tensor shaped (batch*seq,
self.num_local_experts) and routed_experts on the same device), ensure
batch_size calculation matches that forward, respect self.num_local_experts and
devices, and add unit tests or assertions validating shapes and non-zero router
logits for routed cases in the
Qwen3VLRemoveExpertsIndependentHook.get_router_logits_and_routed_experts method.
- Around line 1031-1049: The hook currently appends an unbounded tensor to
self.router_argsort on every forward in __call__, causing OOM for long runs;
change this by adding a configurable cap (e.g., max_entries or max_tokens) and
implement eviction/aggregation behavior when the cap is reached (e.g., drop
oldest entries, aggregate into a single summary tensor, or flush to
disk/memory-map), expose the cap via activation_hooks_kwargs or the class init,
and update __call__ to respect the cap when pushing router_argsort (ensure
conversions to int16 remain but add a runtime check that num_experts ≤ 32767 and
fall back to a larger dtype or error if exceeded); reference
self.router_argsort, __call__, and the local router_argsort variable for where
to make these changes.
- Around line 1169-1202: Summary: The placeholder
get_router_logits_and_routed_experts in class GptOssRemoveExpertsIndependentHook
returns zeros and no-op expert outputs; it must be implemented to mirror
GptOssSparseMoeBlock.forward() to produce meaningful router logits and routed
expert outputs. Fix: update
GptOssRemoveExpertsIndependentHook.get_router_logits_and_routed_experts to call
or replicate the routing logic from GptOssSparseMoeBlock.forward() — accept an
optional router_logits input or compute router logits using the same gating
layer(s) used in GptOssSparseMoeBlock, return router_logits shaped (num_tokens,
num_local_experts) and routed_experts as the actual per-token expert outputs
(flatten/reshape to match forward’s routed expert representation), preserve
device/dtype and batch/token ordering, and handle cases where hidden_states is
(batch, seq_len, hidden_dim) or (num_tokens, hidden_dim); reference
GptOssSparseMoeBlock.forward(), the get_router_logits_and_routed_experts method,
and self.num_local_experts to locate where to implement the logic.
In
`@modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py`:
- Around line 119-124: The get method in ModelDescriptorFactory currently
returns the raw string when an unknown descriptor name is passed, which delays
errors; modify ModelDescriptorFactory.get to check if value is a str and, if
it's not present in ModelDescriptorFactory.CLASS_MAPPING, raise a clear
exception (e.g., KeyError or ValueError) that includes the unknown name and a
brief hint about valid keys, otherwise return the mapped descriptor; keep
existing behavior of returning the passed-through ModelDescriptor when value is
already resolved.
In `@modelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.py`:
- Around line 260-275: The function maybe_cast_block_configs currently inspects
only block_configs[0]; change it to validate homogeneity across the list:
iterate over block_configs and if every element is a dict, return
[BlockConfig(**conf) for conf in block_configs]; if every element is a
BlockConfig, return the list as-is; if the list contains mixed types, raise a
TypeError (or ValueError) with a clear message mentioning
maybe_cast_block_configs and BlockConfig so callers know the input must be all
dicts or all BlockConfig instances; keep None/empty behavior unchanged.
In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py`:
- Around line 115-120: The loop mutates the passed-in dict parameter `keys` by
popping entries (see loops referencing `parent_router_keys` and
`orig_experts_keys`); either update the method/class docstring to explicitly
state that `keys` is modified in-place, or avoid the side effect by operating on
a shallow copy (e.g., work with a local copy of `keys` before the two pop loops)
so callers' dicts are not mutated unexpectedly—reference the method in
expert_removal_pruning_mixin (where `keys`, `parent_router_keys`, and
`orig_experts_keys` are used) when making the change.
In `@modelopt/torch/puzzletron/pruning/pruning_mixin.py`:
- Around line 52-73: The commented-out abstract method prune_single_layer in
class PruningMixIn is ambiguous and should be resolved: either uncomment and
declare it as an abstract method with the given signature (def
prune_single_layer(self, layer_idx: int, parent_state_dict: dict,
new_state_dict: dict, original_config: PretrainedConfig, new_config:
PretrainedConfig, **kwargs) -> None: raise NotImplementedError) so implementers
must provide pruning logic, or remove the entire commented block to avoid
suggesting an interface that doesn't exist; update the class docstring or
supported_hooks() docstring to document the chosen interface change and ensure
references in callers (e.g., any code expecting PruningMixIn.prune_single_layer)
are adjusted accordingly.
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Line 186: ACTIVATIONS_LOG is a global mutable dict that can race in
multi-thread/process training; make it thread/process-safe by replacing it with
a protected/shared structure (e.g., wrap accesses with a module-level
threading.RLock or use a multiprocessing.Manager().dict() if you need
cross-process visibility) and update every read/write to acquire the lock (or
access the Manager dict) where ACTIVATIONS_LOG is used, or explicitly document
in the module docstring that the module assumes single-process execution if you
choose not to change behavior.
In `@modelopt/torch/puzzletron/utils/dummy_modules.py`:
- Around line 44-50: The return type annotation of DummyBlock.forward is
incorrect: it currently declares torch.Tensor | tuple[torch.Tensor, None] but
the implementation always returns x; update the type hint on DummyBlock.forward
to return torch.Tensor (remove the tuple union) to match actual behavior and
improve static typing checks, adjusting any import/type references if needed.
- Around line 30-33: Replace the use of the private torch type in the
load_state_dict_post_hook signature by defining a small Protocol (e.g.,
IncompatibleKeys with attributes missing_keys: list[str] and unexpected_keys:
list[str]) and change the function parameter type to that Protocol; import
Protocol from typing, remove the torch.nn.modules.module._IncompatibleKeys
annotation, and keep the function name load_state_dict_post_hook and its
behavior unchanged so the hook accepts any object exposing .missing_keys and
.unexpected_keys.
In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yaml`:
- Around line 13-14: Add tracked issues for the TODOs so the compatibility work
for init modes isn't lost: create two issues referencing mlp_init_mode and
gqa_init_mode respectively, describe current limitation ("Truncate" and
"AverageKV" only) and request support for CopyAsIs and FromTeacher
initialization paths, link to the config file and mention symbols CopyAsIs and
FromTeacher, and assign/label them as enhancement and technical-debt so they can
be prioritized.
In `@tests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.py`:
- Around line 30-62: The file executes its truncation logic at import time; wrap
the top-level code (the open(in_path)... json.dump(...) and print(...) sequence)
into a main() function (or similar) and add an if __name__ == "__main__": guard
that calls main(), ensuring variables like in_path, out_path and NUM_TO_KEEP are
obtained/used inside main() so importing this module does not run the truncation
immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 33561e3e-db24-44f3-b8b6-dc7786856232
📒 Files selected for processing (28)
modelopt/torch/nas/plugins/megatron_hooks/base_hooks.pymodelopt/torch/puzzletron/anymodel/README.mdmodelopt/torch/puzzletron/anymodel/__init__.pymodelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.pymodelopt/torch/puzzletron/anymodel/models/__init__.pymodelopt/torch/puzzletron/decilm/deci_lm_hf_code/block_config.pymodelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.pymodelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.pymodelopt/torch/puzzletron/pruning/kv_heads_pruning_mixin.pymodelopt/torch/puzzletron/pruning/pruning_ckpts.pymodelopt/torch/puzzletron/pruning/pruning_mixin.pymodelopt/torch/puzzletron/pruning/pruning_utils.pymodelopt/torch/puzzletron/tools/checkpoint_utils_hf.pymodelopt/torch/puzzletron/utils/dummy_modules.pytests/_test_utils/torch/puzzletron/utils.pytests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/attn_pruning.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/ffn_pruning.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/hidden_dim_pruning.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/pruning_defaults.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/validate_model_defaults.yamltests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/validate_solutions_defaults.yamltests/gpu/torch/puzzletron/resources/hf_configs/llama_3_1_8b_instruct/config.jsontests/gpu/torch/puzzletron/resources/tokenizer/special_tokens_map.jsontests/gpu/torch/puzzletron/resources/tokenizer/tokenizer.jsontests/gpu/torch/puzzletron/resources/tokenizer/tokenizer_config.jsontests/gpu/torch/puzzletron/resources/tokenizer/truncate_tokenizer.pytests/gpu/torch/puzzletron/test_puzzletron.py
✅ Files skipped from review due to trivial changes (2)
- tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/pruning_defaults.yaml
- tests/gpu/torch/puzzletron/resources/tokenizer/special_tokens_map.json
| self.diffs = ["mse", "cosine"] | ||
| some_param = next(self.moe.parameters()) | ||
| self.diffs = { | ||
| k: torch.zeros( | ||
| size=(self.num_local_experts,), dtype=torch.float32, device=some_param.device | ||
| ) | ||
| for k in self.diffs | ||
| } |
There was a problem hiding this comment.
Potential StopIteration and confusing initialization pattern.
-
next(self.moe.parameters())will raiseStopIterationif the MoE module has no parameters, causing a cryptic error. Consider usingnext(..., None)with a fallback or explicit validation. -
The initialization of
self.diffsas a list then immediately overwriting it as a dict is confusing. Extract the keys to a local variable for clarity.
Proposed fix
- self.diffs = ["mse", "cosine"]
- some_param = next(self.moe.parameters())
+ diff_keys = ["mse", "cosine"]
+ try:
+ some_param = next(self.moe.parameters())
+ except StopIteration:
+ raise ValueError("MoE module has no parameters - cannot determine device")
self.diffs = {
k: torch.zeros(
size=(self.num_local_experts,), dtype=torch.float32, device=some_param.device
)
- for k in self.diffs
+ for k in diff_keys
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py` around lines 849 -
856, Avoid calling next(self.moe.parameters()) without a fallback and stop
reusing self.diffs as both list and dict: first extract the keys into a local
variable (e.g., diff_keys = ["mse", "cosine"]), then get a safe param sample via
sample_param = next(self.moe.parameters(), None) and if None raise a clear error
or handle it (e.g., create tensors on cpu or choose device explicitly); finally
set self.diffs = {k: torch.zeros(size=(self.num_local_experts,),
dtype=torch.float32, device=sample_param.device if sample_param is not None else
torch.device("cpu")) for k in diff_keys} so initialization is clear and robust.
| ## End-to-end example | ||
|
|
||
| See [test_compress_model.py](../../../../tests/gpu/torch/puzzletron/test_compress.py) for a complete example that runs both convert and compression steps. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "test_compress" tests/gpu/torch/puzzletron/Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
find . -name "*test_compress*" -type f 2>/dev/null | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
ls -la tests/gpu/torch/puzzletron/ 2>/dev/null || echo "Directory not found"
ls -la tests/ 2>/dev/null | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1010
🏁 Script executed:
head -100 tests/gpu/torch/puzzletron/test_puzzletron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3245
🏁 Script executed:
sed -n '95,105p' modelopt/torch/puzzletron/anymodel/README.mdRepository: NVIDIA/Model-Optimizer
Length of output: 443
Fix incorrect test file reference.
The documentation references a non-existent file path. Neither test_compress_model.py nor test_compress.py exists in tests/gpu/torch/puzzletron/. The actual test file containing compression examples in that directory is test_puzzletron.py. Update the reference to point to the correct file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/anymodel/README.md` around lines 97 - 99, Update
the README.md end-to-end example link which currently points to a non-existent
tests/gpu/torch/puzzletron/test_compress_model.py (or test_compress.py); change
the reference to the actual test file
tests/gpu/torch/puzzletron/test_puzzletron.py so the example link and
description in README.md correctly point to test_puzzletron.py.
There was a problem hiding this comment.
@danielkorzekwa PTAL. We renamed from compress to puzzletron
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| if self.layer_descriptor.is_fused_experts: | ||
| # Fused format: unbundle single tensor [num_experts, ...] into list of per-expert tensors | ||
| orig_experts_weights = {} | ||
| for name, fused_keys in orig_experts_keys.items(): | ||
| fused_tensor = parent_state_dict[fused_keys[0]] # Single fused tensor | ||
| orig_experts_weights[name] = [fused_tensor[i] for i in range(orig_num_experts)] | ||
|
|
||
| new_experts_weights = {} | ||
| for name, fused_keys in new_experts_keys.items(): | ||
| fused_tensor = new_state_dict[fused_keys[0]] # Single fused tensor | ||
| new_experts_weights[name] = [fused_tensor[i] for i in range(new_num_experts)] |
There was a problem hiding this comment.
Add guard for empty fused_keys list.
Accessing fused_keys[0] (lines 126, 131) will raise an IndexError if fused_expert_weights is empty. Consider adding validation.
🛡️ Proposed fix
if self.layer_descriptor.is_fused_experts:
# Fused format: unbundle single tensor [num_experts, ...] into list of per-expert tensors
+ if not self.layer_descriptor.fused_expert_weights:
+ raise ValueError("is_fused_experts=True but fused_expert_weights is empty")
orig_experts_weights = {}
for name, fused_keys in orig_experts_keys.items():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/pruning/expert_removal_pruning_mixin.py` around
lines 122 - 132, The code in the fused-experts branch (guarded by
layer_descriptor.is_fused_experts) assumes fused_keys lists are non-empty and
directly indexes fused_keys[0], which can raise IndexError; update the logic in
the block that populates orig_experts_weights and new_experts_weights to
validate fused_keys before using fused_keys[0] (e.g., skip or raise a clear
error when fused_keys is empty), referencing the variables
orig_experts_keys/new_experts_keys, parent_state_dict/new_state_dict and the
dicts orig_experts_weights/new_experts_weights so the fix is applied where those
are constructed.
| for mlp_key in mlp_keys: | ||
| expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0 | ||
| if mlp_key in new_state_dict.keys(): | ||
| mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module( | ||
| mlp_init_mode, | ||
| mlp_prefix, | ||
| expanded_dim, | ||
| layer_idx, | ||
| new_state_dict[mlp_key], | ||
| new_config, | ||
| parent_state_dict[mlp_key], | ||
| original_config, | ||
| mlp_init_config, | ||
| pruned_filters, | ||
| projection_matrix, | ||
| ) | ||
| layer_out_state_dict[mlp_key] = mlp_module_weight |
There was a problem hiding this comment.
Add validation for parent_state_dict key existence.
The code checks if mlp_key exists in new_state_dict (line 86) but doesn't verify it exists in parent_state_dict before accessing it (line 94). This could cause a KeyError if the parent model has different keys.
🛡️ Proposed fix
for mlp_key in mlp_keys:
expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0
- if mlp_key in new_state_dict.keys():
+ if mlp_key in new_state_dict.keys() and mlp_key in parent_state_dict.keys():
mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for mlp_key in mlp_keys: | |
| expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0 | |
| if mlp_key in new_state_dict.keys(): | |
| mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module( | |
| mlp_init_mode, | |
| mlp_prefix, | |
| expanded_dim, | |
| layer_idx, | |
| new_state_dict[mlp_key], | |
| new_config, | |
| parent_state_dict[mlp_key], | |
| original_config, | |
| mlp_init_config, | |
| pruned_filters, | |
| projection_matrix, | |
| ) | |
| layer_out_state_dict[mlp_key] = mlp_module_weight | |
| for mlp_key in mlp_keys: | |
| expanded_dim = 1 if self.layer_descriptor.down_proj_name in mlp_key else 0 | |
| if mlp_key in new_state_dict.keys() and mlp_key in parent_state_dict.keys(): | |
| mlp_module_weight, pruned_filters, projection_matrix = _init_mlp_module( | |
| mlp_init_mode, | |
| mlp_prefix, | |
| expanded_dim, | |
| layer_idx, | |
| new_state_dict[mlp_key], | |
| new_config, | |
| parent_state_dict[mlp_key], | |
| original_config, | |
| mlp_init_config, | |
| pruned_filters, | |
| projection_matrix, | |
| ) | |
| layer_out_state_dict[mlp_key] = mlp_module_weight |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/pruning/ffn_intermediate_pruning_mixin.py` around
lines 84 - 100, The code accesses parent_state_dict[mlp_key] without verifying
the key exists, risking a KeyError; update the loop so before calling
_init_mlp_module you retrieve parent_val = parent_state_dict.get(mlp_key) (or
handle missing key explicitly) and pass parent_val instead of
parent_state_dict[mlp_key]; adjust _init_mlp_module invocation in this block
(the call that assigns mlp_module_weight, pruned_filters, projection_matrix) to
accept a possibly None parent value and ensure layer_out_state_dict[mlp_key] is
set only after a successful init.
| index = {"metadata": {"format": "pt"}, "weight_map": weight_map} | ||
| index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME | ||
| index_json = json_dumps(index) | ||
| _write_file_process_safe(index_json, index_path) | ||
|
|
||
| # Phase 4: Copy HF code | ||
| phase4_start_time = time.time() | ||
| copy_deci_lm_hf_code(checkpoint_dir) | ||
| phase4_time = time.time() - phase4_start_time | ||
| mprint(f"Phase 4 - Copy HF code: {phase4_time:.2f}s") | ||
| # Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens | ||
| if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict: | ||
| lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" | ||
| state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name} | ||
| weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name} |
There was a problem hiding this comment.
Fix tied-embedding handling before writing the safetensors index.
Lines 184-187 write the index before Lines 190-193 remove tied output embedding entries. That can leave stale entries in the persisted index. Also, Line 190 checks hardcoded "lm_head.weight" instead of descriptor-derived output embedding name.
Proposed fix
- # Write index
- index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
- index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
- index_json = json_dumps(index)
- _write_file_process_safe(index_json, index_path)
-
# Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens
- if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict:
- lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight"
- state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name}
- weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name}
+ output_weight_name = f"{descriptor.output_embedding_name()}.weight"
+ if getattr(model_config, "tie_word_embeddings", False) and output_weight_name in state_dict:
+ state_dict = {k: v for k, v in state_dict.items() if k != output_weight_name}
+ weight_map = {k: v for k, v in weight_map.items() if k != output_weight_name}
+
+ # Write index after finalizing weight_map
+ index = {"metadata": {"format": "pt"}, "weight_map": weight_map}
+ index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME
+ index_json = json_dumps(index)
+ _write_file_process_safe(index_json, index_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 184 -
193, The code writes the safetensors index (index_path / SAFE_WEIGHTS_INDEX_NAME
via _write_file_process_safe) before removing tied output-embedding entries, and
it checks the hardcoded "lm_head.weight" instead of using the descriptor; move
the tied-embedding handling block to run before building/writing the index_json
and index_path, and replace the hardcoded string with the descriptor-derived
name computed as lm_head_weight_name =
f"{descriptor.output_embedding_name()}.weight"; then filter both state_dict and
weight_map to drop that lm_head_weight_name so the persisted index and
weight_map are consistent with the saved tensors.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| # TODO commented for the duration of merging process from dkorzekwa/any_model to feature/puzzletron | ||
| # # Compress the model using a one-click approach | ||
| # puzzletron.puzzletron( | ||
| # str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path) | ||
| # ) | ||
|
|
||
| # | ||
| # Check assertions | ||
| # | ||
| # assertions for the score_pruning_activations step 1 | ||
| _assert_score_pruning_activations(puzzle_dir) | ||
| if rank == 0: | ||
| # assertions for the pruning_ckpts step 2 | ||
| assert (puzzle_dir / "ckpts/ffn_256_attn_no_op").exists() | ||
| # # | ||
| # # Check assertions | ||
| # # | ||
| # if rank == 0: | ||
| # if has_moe_layers: | ||
| # # assertions for the score_pruning_activations step 1 (MoE models only) | ||
| # rank_filepath = ( | ||
| # f"pruning/pruning_scores/expert_removal/10samples_diverse_mini/rank_{rank}.pth" | ||
| # ) | ||
| # assert (puzzle_dir / rank_filepath).is_file(), f"Expected {rank_filepath} to exist" | ||
|
|
||
| # assertions for the build_library_and_stats step 4 | ||
| # # assertions for the pruning_ckpts step 2 | ||
| # assert (puzzle_dir / "ckpts/num_experts_8").exists() | ||
|
|
||
| assert (puzzle_dir / "replacement_library.json").is_file() | ||
| assert (puzzle_dir / "subblock_stats.json").is_file() | ||
| # # assertions for the mip_and_realize_models step 6 | ||
| # # Find the MIP solution directory dynamically (e.g., stats_num_local_experts_*) | ||
| # mip_solutions_dir = puzzle_dir / "mip/puzzle_solutions" | ||
| # solution_dirs = [ | ||
| # d | ||
| # for d in mip_solutions_dir.iterdir() | ||
| # if d.is_dir() and d.name.startswith("stats_num_local_experts_") | ||
| # ] | ||
| # assert len(solution_dirs) == 1, ( | ||
| # f"Expected exactly one stats_num_local_experts_* directory, found: {[d.name for d in solution_dirs]}" | ||
| # ) | ||
| # solution_dir = solution_dirs[0] | ||
|
|
||
| # assertions for the scoring step 5 | ||
| solution_0_filepath = ( | ||
| puzzle_dir / "single_sequence_replacement_solutions--validation/solution_0.json" | ||
| ) | ||
| # solution_0_ckpt_config_path = ( | ||
| # solution_dir / "solutions--checkpoints/solution_0/config.json" | ||
| # ) | ||
| # assert solution_0_ckpt_config_path.exists() | ||
| # assert (solution_dir / "solutions.json").exists() | ||
|
|
||
| assert solution_0_filepath.exists() | ||
| # # Validate lm_loss | ||
| # _assert_lm_loss(puzzle_dir, hf_config_name) | ||
| # else: | ||
| # # assertions for the score_pruning_activations step 1 (FFN pruning) | ||
| # _assert_score_pruning_activations(puzzle_dir, hf_config_name) | ||
|
|
||
| # assertions for the mip_and_realize_models step 6 | ||
| solution_0_ckpt_config_path = ( | ||
| puzzle_dir | ||
| / "mip/puzzle_solutions/target_memory_780000MiB/solutions--checkpoints/solution_0/config.json" | ||
| ) | ||
| # # assertions for the pruning_ckpts step 2 | ||
| # assert (puzzle_dir / "ckpts/ffn_256_attn_no_op").exists() | ||
|
|
||
| # # assertions for the mip_and_realize_models step 6 | ||
| # _assert_mip_solutions(puzzle_dir, hf_config_name) | ||
|
|
||
| assert solution_0_ckpt_config_path.exists() | ||
| assert (puzzle_dir / "mip/puzzle_solutions/target_memory_780000MiB/solutions.json").exists() | ||
| # # assertions for the build_library_and_stats step 4 | ||
| # assert (puzzle_dir / "replacement_library.json").is_file() | ||
| # assert (puzzle_dir / "subblock_stats.json").is_file() | ||
|
|
||
| # # assertions for the scoring step 5 | ||
| # solution_0_filepath = ( | ||
| # puzzle_dir / "single_sequence_replacement_solutions--validation/solution_0.json" | ||
| # ) | ||
| # assert solution_0_filepath.exists() | ||
|
|
There was a problem hiding this comment.
This test currently no longer validates Puzzletron pipeline behavior.
Lines 122-182 comment out compression and all assertions, so the test mainly verifies conversion and can pass while pruning/MIP/realization regress.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 122 - 182, The
test disabled the core Puzzletron pipeline (puzzletron.puzzletron invocation)
and all downstream assertions, so restore the call to
puzzletron.puzzletron(str(hydra_config_dir), hydra_config_subdir,
str(puzzle_dir), str(dataset_path)) and re-enable the assertions block that
checks both MoE and non-MoE paths; ensure the branches referencing
has_moe_layers and the helper validators (_assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions) are active again, plus
the file/directory existence checks (replacement_library.json,
subblock_stats.json, solution_0.json and the mip solution directory checks) so
the test verifies pruning/MIP/realization behavior end-to-end.
| expected = EXPECTED_PRUNING_VALUES[hf_config_name] | ||
| size = dist.size() | ||
|
|
||
| if expected is not None: | ||
| # In multi-GPU: layers are distributed across ranks | ||
| # Each rank processes len(expected) // size layers | ||
| expected_layers_per_rank = len(expected) // size | ||
| assert len(layer_names) == expected_layers_per_rank, ( | ||
| f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}" | ||
| ) | ||
| # Check each layer's values | ||
| for i, layer_name in enumerate(layer_names): | ||
| layer_data = pruning_scores[layer_name] | ||
| # Calculate global layer index from rank and local index | ||
| global_idx = rank * expected_layers_per_rank + i | ||
| assert layer_data["score"][0].item() == expected[global_idx]["score"] | ||
| assert ( | ||
| layer_data["channels_importance_ascending"][0].item() | ||
| == expected[global_idx]["channels"] | ||
| ) | ||
| else: | ||
| # Print values for new models - update EXPECTED_PRUNING_VALUES with these |
There was a problem hiding this comment.
Use .get() for expected pruning values before None fallback.
Line 248 uses direct indexing (EXPECTED_PRUNING_VALUES[hf_config_name]), which raises KeyError before the fallback branch on Line 251 can run.
Proposed fix
- expected = EXPECTED_PRUNING_VALUES[hf_config_name]
+ expected = EXPECTED_PRUNING_VALUES.get(hf_config_name)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expected = EXPECTED_PRUNING_VALUES[hf_config_name] | |
| size = dist.size() | |
| if expected is not None: | |
| # In multi-GPU: layers are distributed across ranks | |
| # Each rank processes len(expected) // size layers | |
| expected_layers_per_rank = len(expected) // size | |
| assert len(layer_names) == expected_layers_per_rank, ( | |
| f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}" | |
| ) | |
| # Check each layer's values | |
| for i, layer_name in enumerate(layer_names): | |
| layer_data = pruning_scores[layer_name] | |
| # Calculate global layer index from rank and local index | |
| global_idx = rank * expected_layers_per_rank + i | |
| assert layer_data["score"][0].item() == expected[global_idx]["score"] | |
| assert ( | |
| layer_data["channels_importance_ascending"][0].item() | |
| == expected[global_idx]["channels"] | |
| ) | |
| else: | |
| # Print values for new models - update EXPECTED_PRUNING_VALUES with these | |
| expected = EXPECTED_PRUNING_VALUES.get(hf_config_name) | |
| size = dist.size() | |
| if expected is not None: | |
| # In multi-GPU: layers are distributed across ranks | |
| # Each rank processes len(expected) // size layers | |
| expected_layers_per_rank = len(expected) // size | |
| assert len(layer_names) == expected_layers_per_rank, ( | |
| f"Expected {expected_layers_per_rank} FFN layers on rank {rank}/{size}, got {len(layer_names)}" | |
| ) | |
| # Check each layer's values | |
| for i, layer_name in enumerate(layer_names): | |
| layer_data = pruning_scores[layer_name] | |
| # Calculate global layer index from rank and local index | |
| global_idx = rank * expected_layers_per_rank + i | |
| assert layer_data["score"][0].item() == expected[global_idx]["score"] | |
| assert ( | |
| layer_data["channels_importance_ascending"][0].item() | |
| == expected[global_idx]["channels"] | |
| ) | |
| else: | |
| # Print values for new models - update EXPECTED_PRUNING_VALUES with these |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 248 - 269, The
test currently indexes EXPECTED_PRUNING_VALUES with hf_config_name which raises
KeyError before the None branch can run; change the lookup to use
EXPECTED_PRUNING_VALUES.get(hf_config_name) (or .get(hf_config_name, None)) when
assigning expected so missing keys return None and the existing fallback branch
executes; update the reference where expected is set (the assignment using
EXPECTED_PRUNING_VALUES and hf_config_name) and leave the subsequent logic that
checks if expected is not None unchanged.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py (1)
360-364:⚠️ Potential issue | 🟠 Major
max_layer_workerscan become 0 and crash executor creation.If
owned_block_indexesis empty, Line 363 computesmax_layer_workers = 0;ThreadPoolExecutor(max_workers=0)raisesValueErrorat Line 452.Proposed fix
if max_layer_workers is None: cpu_count = os.cpu_count() or 1 num_layers = len(owned_block_indexes) - max_layer_workers = min(cpu_count, num_layers) + max_layer_workers = max(1, min(cpu_count, num_layers)) mprint( f"Auto-calculated layer workers: min({cpu_count} CPUs, {num_layers} layers) = {max_layer_workers}" )Also applies to: 452-453
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines 360 - 364, The code computes max_layer_workers from cpu_count and num_layers (len(owned_block_indexes)) and can set it to 0 when owned_block_indexes is empty; change the computation to ensure at least 1 (e.g., max_layer_workers = max(1, min(cpu_count, num_layers))) and also guard the ThreadPoolExecutor creation (ThreadPoolExecutor(max_workers=...)) to never be called with 0—either use the adjusted max_layer_workers or skip/avoid creating the executor when num_layers == 0. Update all occurrences where max_layer_workers is derived or passed into ThreadPoolExecutor to use this nonzero-safe logic (references: max_layer_workers, owned_block_indexes, ThreadPoolExecutor).
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py (2)
71-72:descriptoris currently unused in per-layer processing.Line 72 adds
descriptor, but it is not consumed in the fallback path and not forwarded toprune_single_layer. Consider removing it from_process_single_layeror passing it through to keep the extension API consistent.Also applies to: 96-112
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines 71 - 72, The parameter descriptor was added but never used or forwarded in per-layer processing; either remove descriptor from _process_single_layer's signature or pass it through into prune_single_layer and the fallback path so extensions receive it. Update _process_single_layer (and its call sites) to either drop descriptor entirely or include descriptor when invoking prune_single_layer and any fallback branches (e.g., the branch handling pruning_mixin) to keep the extension API consistent with the new parameter.
900-902:owned_block_indexesis typed optional but used as required.Line 901 defaults to
None, yet Line 968 iterates it directly. Either remove optionality from the signature or initialize a default inside the function for standalone calls.Also applies to: 968-968
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines 900 - 902, The function signature declares owned_block_indexes: Optional[list[int]] = None but the function (in child_init.py, the function that returns dict[str, torch.Tensor]) iterates owned_block_indexes at runtime; fix by either removing Optional and making the parameter required (owned_block_indexes: list[int]) or keep the optional but initialize a safe default before use (e.g., inside the function set owned_block_indexes = owned_block_indexes or [] / if owned_block_indexes is None: owned_block_indexes = []) so the later iteration over owned_block_indexes at the iteration site (where it's used) will never fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 549-550: The function accepts mlp_init_config: Optional[dict] but
the score-based expert pruning branch uses it unguarded and will raise on None
when concatenating routed experts; add a guard in the pruning path (where
score-based expert pruning is executed) to validate mlp_init_config is not None
before using it, and either (a) skip/disable the score-based pruning logic when
mlp_init_config is None, or (b) raise a clear ValueError indicating
mlp_init_config is required, or (c) supply a safe default dict; update the code
around the score-based pruning block (the logic referencing mlp_init_config and
the routed experts concatenation) accordingly so no NoneType access occurs.
- Around line 130-137: The code filters attn_keys to only those present in
new_state_dict then immediately re-checks those same filtered keys, which always
makes the branch true and prevents the linear-attn fallback; fix by preserving
the original list of attention candidate keys (e.g., orig_attn_keys = attn_keys
before filtering) and change the second conditional to test against
orig_attn_keys (or explicitly check for the presence of q/k/v/o names in
new_state_dict) instead of the already-filtered attn_keys so the fallback can be
selected when q/k/v/o are absent; make the same correction for the analogous
logic around is_student_and_teacher_have_same_attention_implementation in the
second block (lines ~177-183) and update references to keys, keys_to_remove,
ignored_keys, new_state_dict, and
is_student_and_teacher_have_same_attention_implementation accordingly.
- Around line 41-55: The call site incorrectly invokes
_load_expert_scores(mlp_init_config)[layer_idx] even though _load_expert_scores
requires (mlp_init_config, layer_idx); replace that pattern with the canonical
helper _select_expert_indices(mlp_init_config, layer_idx) which is already
imported and encapsulates the correct expert selection logic (use the same
layer_idx variable in the current context where the faulty call occurs).
---
Outside diff comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 360-364: The code computes max_layer_workers from cpu_count and
num_layers (len(owned_block_indexes)) and can set it to 0 when
owned_block_indexes is empty; change the computation to ensure at least 1 (e.g.,
max_layer_workers = max(1, min(cpu_count, num_layers))) and also guard the
ThreadPoolExecutor creation (ThreadPoolExecutor(max_workers=...)) to never be
called with 0—either use the adjusted max_layer_workers or skip/avoid creating
the executor when num_layers == 0. Update all occurrences where
max_layer_workers is derived or passed into ThreadPoolExecutor to use this
nonzero-safe logic (references: max_layer_workers, owned_block_indexes,
ThreadPoolExecutor).
---
Nitpick comments:
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py`:
- Around line 71-72: The parameter descriptor was added but never used or
forwarded in per-layer processing; either remove descriptor from
_process_single_layer's signature or pass it through into prune_single_layer and
the fallback path so extensions receive it. Update _process_single_layer (and
its call sites) to either drop descriptor entirely or include descriptor when
invoking prune_single_layer and any fallback branches (e.g., the branch handling
pruning_mixin) to keep the extension API consistent with the new parameter.
- Around line 900-902: The function signature declares owned_block_indexes:
Optional[list[int]] = None but the function (in child_init.py, the function that
returns dict[str, torch.Tensor]) iterates owned_block_indexes at runtime; fix by
either removing Optional and making the parameter required (owned_block_indexes:
list[int]) or keep the optional but initialize a safe default before use (e.g.,
inside the function set owned_block_indexes = owned_block_indexes or [] / if
owned_block_indexes is None: owned_block_indexes = []) so the later iteration
over owned_block_indexes at the iteration site (where it's used) will never
fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a6e21cfa-5a6c-477b-be0e-01128b089f6f
📒 Files selected for processing (1)
modelopt/torch/puzzletron/tools/bypassed_training/child_init.py
| attn_keys = [key for key in attn_keys if key in new_state_dict.keys()] | ||
| if len(attn_keys) > 0 and all(key in keys for key in attn_keys): | ||
| for key in attn_keys: | ||
| keys_to_remove[key] = keys[key] | ||
| if all(key not in ignored_keys for key in attn_keys): | ||
| is_student_and_teacher_have_same_attention_implementation = all( | ||
| key in new_state_dict for key in attn_keys | ||
| key in new_state_dict.keys() for key in attn_keys | ||
| ) |
There was a problem hiding this comment.
Linear-attention initialization path is effectively bypassed.
Line 130 filters attn_keys to keys present in new_state_dict, then Line 136 checks the same condition again, so that branch is always true for non-empty attn_keys. This prevents the linear_attn fallback path from being selected when q/k/v/o are absent.
Proposed fix
- attn_keys = [q_key, k_key, v_key, o_key]
- # Drop attn keys that don't exist and required to be in the new state_dict
- attn_keys = [key for key in attn_keys if key in new_state_dict.keys()]
- if len(attn_keys) > 0 and all(key in keys for key in attn_keys):
+ full_attn_keys = [q_key, k_key, v_key, o_key]
+ has_qkvo_in_student = all(key in new_state_dict for key in full_attn_keys)
+ if has_qkvo_in_student and all(key in keys for key in full_attn_keys):
- for key in attn_keys:
+ for key in full_attn_keys:
keys_to_remove[key] = keys[key]
- if all(key not in ignored_keys for key in attn_keys):
- is_student_and_teacher_have_same_attention_implementation = all(
- key in new_state_dict.keys() for key in attn_keys
- )
- if is_student_and_teacher_have_same_attention_implementation:
+ if all(key not in ignored_keys for key in full_attn_keys):
+ if has_qkvo_in_student:
...
else:
...
+ else:
+ linear_attn_key = f"{attn_prefix}.linear_attn.weight"
+ if linear_attn_key in new_state_dict:
+ ...Also applies to: 177-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
130 - 137, The code filters attn_keys to only those present in new_state_dict
then immediately re-checks those same filtered keys, which always makes the
branch true and prevents the linear-attn fallback; fix by preserving the
original list of attention candidate keys (e.g., orig_attn_keys = attn_keys
before filtering) and change the second conditional to test against
orig_attn_keys (or explicitly check for the presence of q/k/v/o names in
new_state_dict) instead of the already-filtered attn_keys so the fallback can be
selected when q/k/v/o are absent; make the same correction for the analogous
logic around is_student_and_teacher_have_same_attention_implementation in the
second block (lines ~177-183) and update references to keys, keys_to_remove,
ignored_keys, new_state_dict, and
is_student_and_teacher_have_same_attention_implementation accordingly.
| mlp_init_config: Optional[dict], | ||
| hidden_size: int, |
There was a problem hiding this comment.
mlp_init_config is optional now, but score-based pruning path assumes it is populated.
Line 549 allows None, but Line 595 enters score-based expert pruning without validating config availability. This can fail with NoneType access when concatenating routed experts.
Proposed fix
else:
print(
f"Concatenating {num_concatenated_routed_experts} routed experts to the shared expert in layer {layer_idx}"
)
+ if mlp_init_config is None:
+ raise ValueError(
+ "mlp_init_config is required for expert-score-based concatenation."
+ )
router_key, orig_experts_keys = _generate_moe_keys(
layer_idx, parent_moe_config.num_local_experts
)Also applies to: 595-601
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/tools/bypassed_training/child_init.py` around lines
549 - 550, The function accepts mlp_init_config: Optional[dict] but the
score-based expert pruning branch uses it unguarded and will raise on None when
concatenating routed experts; add a guard in the pruning path (where score-based
expert pruning is executed) to validate mlp_init_config is not None before using
it, and either (a) skip/disable the score-based pruning logic when
mlp_init_config is None, or (b) raise a clear ValueError indicating
mlp_init_config is required, or (c) supply a safe default dict; update the code
around the score-based pruning block (the logic referencing mlp_init_config and
the routed experts concatenation) accordingly so no NoneType access occurs.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
modelopt/torch/puzzletron/pruning/pruning_utils.py (1)
195-200:⚠️ Potential issue | 🔴 CriticalCRITICAL: Harden
torch.loadusage for activation logs.At Line 199,
torch.load(p)deserializes pickle content withoutweights_only=Trueor an inline safety justification. This is still a security risk for.pthingestion.🔒 Proposed fix
ACTIVATIONS_LOG.update( { module_name: module_log for p in Path(activations_log_dir).glob("rank_*.pth") - for module_name, module_log in torch.load(p).items() + # Safe: rank_*.pth files are internally generated by hook dumps, not user-supplied. + for module_name, module_log in torch.load(p, weights_only=True).items() } )#!/bin/bash rg -nP 'torch\.load\(' modelopt/torch/puzzletron/pruning/pruning_utils.py -C2As per coding guidelines: "torch.load(..., weights_only=False) with no inline comment justifying why it is safe (e.g. confirming the file is internally-generated and not user-supplied)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 195 - 200, The ACTIVATIONS_LOG population uses torch.load(p) unsafely; change the call inside the dict comprehension that iterates Path(activations_log_dir).glob("rank_*.pth") to use torch.load(p, weights_only=True) to avoid pickle deserialization, updating the code that builds module_name/module_log entries; if weights_only=True is infeasible, add an inline comment justifying why full deserialization is safe and instead validate the file is internal/trusted before calling torch.load, or replace the load with a safe parser that only reads tensors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 556-559: The loop assumes orig_router_weights[name] always exists
which causes KeyError for router keys present only in new_router_weights; update
the code in the loop over new_router_weights to first try
orig_router_weights.get(name) and, if missing, attempt a parent-key fallback
(e.g., parent_name = name.rsplit('.', 1)[0] and use
orig_router_weights.get(parent_name)); if neither exists, handle gracefully
(skip, use an empty/default list, or raise a clear error). Ensure you reference
and use the same variables: new_router_weights, orig_router_weights,
result_router_weights, name, and tensor_to_slice so the selection expression
remains result_router_weights[name] = [tensor_to_slice[selected_experts] for
tensor_to_slice in chosen_orig_list].
- Around line 599-603: selected_experts currently treats NaN scores as
float("inf") which makes NaNs rank as best when
mlp_init_config.get("higher_is_better", True) is True; fix by computing
higher_is_better = mlp_init_config.get("higher_is_better", True) before the sort
and change the key so NaNs map to (-inf if higher_is_better else +inf) instead
of always +inf; update the lambda used in the sorted call that references
expert_scores, orig_num_experts, and new_num_experts to return math.nan-safe
sentinel values based on higher_is_better so NaNs are treated as worst rather
than best.
- Around line 189-201: _cache_activations_log currently never refreshes
ACTIVATIONS_LOG after the first call, causing stale activations to be reused;
update the function to detect a new activations_log_dir and reload by clearing
or replacing ACTIVATIONS_LOG (and optionally store the last-loaded dir in a
module-level variable) before repopulating it via the
Path(...).glob("rank_*.pth") + torch.load loop; reference the function name
_cache_activations_log and the ACTIVATIONS_LOG global and ensure you either
clear ACTIVATIONS_LOG.update(...) or reassign ACTIVATIONS_LOG so subsequent
calls with a different activations_log_dir actually reload the files.
---
Duplicate comments:
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py`:
- Around line 195-200: The ACTIVATIONS_LOG population uses torch.load(p)
unsafely; change the call inside the dict comprehension that iterates
Path(activations_log_dir).glob("rank_*.pth") to use torch.load(p,
weights_only=True) to avoid pickle deserialization, updating the code that
builds module_name/module_log entries; if weights_only=True is infeasible, add
an inline comment justifying why full deserialization is safe and instead
validate the file is internal/trusted before calling torch.load, or replace the
load with a safe parser that only reads tensors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 08e6808f-b3bc-4046-8904-f7a5deaa4014
📒 Files selected for processing (1)
modelopt/torch/puzzletron/pruning/pruning_utils.py
| def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None: | ||
| if len(ACTIVATIONS_LOG) == 0: | ||
| assert "activations_log_dir" in mlp_init_config | ||
| activations_log_dir = mlp_init_config["activations_log_dir"] | ||
| print(f"Loading activations_log from {activations_log_dir}") | ||
| # Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files | ||
| ACTIVATIONS_LOG.update( | ||
| { | ||
| module_name: module_log | ||
| for p in Path(activations_log_dir).glob("rank_*.pth") | ||
| for module_name, module_log in torch.load(p).items() | ||
| } | ||
| ) |
There was a problem hiding this comment.
Cache invalidation bug: activation logs never refresh for a new directory.
After first load, _cache_activations_log ignores later activations_log_dir changes, so pruning can silently use stale scores.
♻️ Proposed fix
ACTIVATIONS_LOG = dict()
+ACTIVATIONS_LOG_DIR: Optional[str] = None
def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None:
- if len(ACTIVATIONS_LOG) == 0:
- assert "activations_log_dir" in mlp_init_config
- activations_log_dir = mlp_init_config["activations_log_dir"]
+ global ACTIVATIONS_LOG_DIR
+ assert "activations_log_dir" in mlp_init_config
+ activations_log_dir = mlp_init_config["activations_log_dir"]
+ if len(ACTIVATIONS_LOG) == 0 or ACTIVATIONS_LOG_DIR != activations_log_dir:
+ ACTIVATIONS_LOG.clear()
+ ACTIVATIONS_LOG_DIR = activations_log_dir
print(f"Loading activations_log from {activations_log_dir}")
# Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files
ACTIVATIONS_LOG.update(
{
module_name: module_log
for p in Path(activations_log_dir).glob("rank_*.pth")
for module_name, module_log in torch.load(p).items()
}
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None: | |
| if len(ACTIVATIONS_LOG) == 0: | |
| assert "activations_log_dir" in mlp_init_config | |
| activations_log_dir = mlp_init_config["activations_log_dir"] | |
| print(f"Loading activations_log from {activations_log_dir}") | |
| # Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files | |
| ACTIVATIONS_LOG.update( | |
| { | |
| module_name: module_log | |
| for p in Path(activations_log_dir).glob("rank_*.pth") | |
| for module_name, module_log in torch.load(p).items() | |
| } | |
| ) | |
| ACTIVATIONS_LOG = dict() | |
| ACTIVATIONS_LOG_DIR: Optional[str] = None | |
| def _cache_activations_log(mlp_init_config: dict[str, Any]) -> None: | |
| global ACTIVATIONS_LOG_DIR | |
| assert "activations_log_dir" in mlp_init_config | |
| activations_log_dir = mlp_init_config["activations_log_dir"] | |
| if len(ACTIVATIONS_LOG) == 0 or ACTIVATIONS_LOG_DIR != activations_log_dir: | |
| ACTIVATIONS_LOG.clear() | |
| ACTIVATIONS_LOG_DIR = activations_log_dir | |
| print(f"Loading activations_log from {activations_log_dir}") | |
| # Only load rank_*.pth files to avoid loading hook_states_*.pth checkpoint files | |
| ACTIVATIONS_LOG.update( | |
| { | |
| module_name: module_log | |
| for p in Path(activations_log_dir).glob("rank_*.pth") | |
| for module_name, module_log in torch.load(p).items() | |
| } | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 189 - 201,
_cache_activations_log currently never refreshes ACTIVATIONS_LOG after the first
call, causing stale activations to be reused; update the function to detect a
new activations_log_dir and reload by clearing or replacing ACTIVATIONS_LOG (and
optionally store the last-loaded dir in a module-level variable) before
repopulating it via the Path(...).glob("rank_*.pth") + torch.load loop;
reference the function name _cache_activations_log and the ACTIVATIONS_LOG
global and ensure you either clear ACTIVATIONS_LOG.update(...) or reassign
ACTIVATIONS_LOG so subsequent calls with a different activations_log_dir
actually reload the files.
| for name, new_list in new_router_weights.items(): | ||
| result_router_weights[name] = [ | ||
| tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name] | ||
| ] |
There was a problem hiding this comment.
Router fallback is not implemented and can crash with missing parent key.
At Line 558, orig_router_weights[name] is always used, despite the comment saying child fallback is supported. If a router key exists only in new_router_weights, this raises KeyError.
🐛 Proposed fix
result_router_weights: dict[str, list[torch.Tensor]] = {}
for name, new_list in new_router_weights.items():
- result_router_weights[name] = [
- tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
- ]
+ if name in orig_router_weights:
+ result_router_weights[name] = [
+ tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name]
+ ]
+ else:
+ # Child-only tensors (e.g., newly introduced bias) keep child initialization.
+ result_router_weights[name] = new_list📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for name, new_list in new_router_weights.items(): | |
| result_router_weights[name] = [ | |
| tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name] | |
| ] | |
| result_router_weights: dict[str, list[torch.Tensor]] = {} | |
| for name, new_list in new_router_weights.items(): | |
| if name in orig_router_weights: | |
| result_router_weights[name] = [ | |
| tensor_to_slice[selected_experts] for tensor_to_slice in orig_router_weights[name] | |
| ] | |
| else: | |
| # Child-only tensors (e.g., newly introduced bias) keep child initialization. | |
| result_router_weights[name] = new_list |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 556 - 559,
The loop assumes orig_router_weights[name] always exists which causes KeyError
for router keys present only in new_router_weights; update the code in the loop
over new_router_weights to first try orig_router_weights.get(name) and, if
missing, attempt a parent-key fallback (e.g., parent_name = name.rsplit('.',
1)[0] and use orig_router_weights.get(parent_name)); if neither exists, handle
gracefully (skip, use an empty/default list, or raise a clear error). Ensure you
reference and use the same variables: new_router_weights, orig_router_weights,
result_router_weights, name, and tensor_to_slice so the selection expression
remains result_router_weights[name] = [tensor_to_slice[selected_experts] for
tensor_to_slice in chosen_orig_list].
| selected_experts = sorted( | ||
| range(orig_num_experts), | ||
| key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else float("inf"), | ||
| reverse=mlp_init_config.get("higher_is_better", True), | ||
| )[:new_num_experts] |
There was a problem hiding this comment.
NaN ranking currently selects NaNs as “best” experts by default.
At Line 601, NaN is mapped to inf; with higher_is_better=True (Line 602), reverse sort puts NaNs first.
✅ Proposed fix
- selected_experts = sorted(
+ higher_is_better = mlp_init_config.get("higher_is_better", True)
+ nan_sentinel = float("-inf") if higher_is_better else float("inf")
+ selected_experts = sorted(
range(orig_num_experts),
- key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else float("inf"),
- reverse=mlp_init_config.get("higher_is_better", True),
+ key=lambda i: expert_scores[i] if not math.isnan(expert_scores[i]) else nan_sentinel,
+ reverse=higher_is_better,
)[:new_num_experts]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/pruning/pruning_utils.py` around lines 599 - 603,
selected_experts currently treats NaN scores as float("inf") which makes NaNs
rank as best when mlp_init_config.get("higher_is_better", True) is True; fix by
computing higher_is_better = mlp_init_config.get("higher_is_better", True)
before the sort and change the key so NaNs map to (-inf if higher_is_better else
+inf) instead of always +inf; update the lambda used in the sorted call that
references expert_scores, orig_num_experts, and new_num_experts to return
math.nan-safe sentinel values based on higher_is_better so NaNs are treated as
worst rather than best.
There was a problem hiding this comment.
@danielkorzekwa this bug fix seems important. PTAL
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
tests/gpu/torch/puzzletron/test_puzzletron.py (2)
248-251:⚠️ Potential issue | 🟡 MinorUse
.get()to preserve your fallback path.Line 248 uses direct indexing, which can raise
KeyErrorbefore Line 251’sNonecheck executes.Proposed fix
- expected = EXPECTED_PRUNING_VALUES[hf_config_name] + expected = EXPECTED_PRUNING_VALUES.get(hf_config_name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 248 - 251, Replace the direct dictionary access EXPECTED_PRUNING_VALUES[hf_config_name] with a safe lookup using EXPECTED_PRUNING_VALUES.get(hf_config_name) so that the variable expected can be None and the subsequent if expected is not None check remains reachable; update the assignment to use .get(...) for hf_config_name when computing expected in the test_puzzletron.py block.
122-182:⚠️ Potential issue | 🟠 MajorRe-enable the core Puzzletron flow (currently false-green test).
Line 122 through Line 182 comment out the compression call and all end-to-end assertions, so this test can pass while pruning/MIP/realization behavior regresses. Please restore this block (or explicitly
pytest.skip/xfailthe test until it is restored).Proposed fix
+from modelopt.torch import puzzletron ... - # TODO commented for the duration of merging process from dkorzekwa/any_model to feature/puzzletron - # # Compress the model using a one-click approach - # puzzletron.puzzletron( - # str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path) - # ) + puzzletron.puzzletron( + str(hydra_config_dir), hydra_config_subdir, str(puzzle_dir), str(dataset_path) + ) ... - # if rank == 0: - # ... + if rank == 0: + ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 122 - 182, Restore the commented-out end-to-end Puzzletron flow by re-enabling the call to puzzletron.puzzletron(...) and the subsequent rank==0 assertions (including checks that reference _assert_lm_loss, _assert_score_pruning_activations, _assert_mip_solutions, and file existence assertions for replacement_library.json, subblock_stats.json, and solution_0.json); if the full flow cannot be reliably executed in CI, replace the restored block with an explicit pytest.skip or pytest.xfail that documents why the E2E flow is disabled so the test suite fails loudly until the pruning/MIP/realize implementation is fixed.modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)
200-210:⚠️ Potential issue | 🟠 MajorWrite the index after tied-embedding pruning, and use descriptor-derived key consistently.
Line 204 persists
weight_mapbefore Line 206-210 prunes tied output embedding entries. Also, Line 207 checks hardcoded"lm_head.weight"instead of the descriptor-derived output key, which can miss non-LLaMA naming.🧩 Proposed fix
- # Write index - index = {"metadata": {"format": "pt"}, "weight_map": weight_map} - index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME - index_json = json_dumps(index) - _write_file_process_safe(index_json, index_path) - # Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens - if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict: - lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" + lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" + if getattr(model_config, "tie_word_embeddings", False) and lm_head_weight_name in state_dict: state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name} weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name} + + # Write index after finalizing weight_map + index = {"metadata": {"format": "pt"}, "weight_map": weight_map} + index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME + index_json = json_dumps(index) + _write_file_process_safe(index_json, index_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 200 - 210, The index is written before tied-embedding pruning and the code checks the hardcoded "lm_head.weight" key; update checkpoint_utils_hf.py so you first perform the tie_word_embeddings pruning using model_config.tie_word_embeddings and the descriptor-derived key lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" to remove entries from state_dict and weight_map, and only then build and write the index (index = {"metadata": {"format": "pt"}, "weight_map": weight_map}, json_dumps(index), _write_file_process_safe) using SAFE_WEIGHTS_INDEX_NAME; ensure you also stop checking the literal "lm_head.weight" and use the descriptor-derived lm_head_weight_name consistently.
🧹 Nitpick comments (2)
tests/gpu/torch/puzzletron/test_puzzletron.py (2)
295-299: Fail fast for unknownEXPECTED_LM_LOSSentries.Like the pruning helper, this print-only fallback lets tests pass without validating LM loss for newly added models.
Proposed fix
- else: - # Print value for new models - update EXPECTED_LM_LOSS with this - print(f"\n=== LM_LOSS for {hf_config_name} ===") - print(f'"{hf_config_name}": {actual_lm_loss},') - print("===") + else: + pytest.fail( + f"Missing EXPECTED_LM_LOSS baseline for '{hf_config_name}'. " + f"Observed lm_loss={actual_lm_loss}. Add baseline before enabling this config." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 295 - 299, The current else branch that only prints LM loss for unknown hf_config_name lets tests pass without validating against EXPECTED_LM_LOSS; replace the print-only fallback with a hard fail: when hf_config_name is not in EXPECTED_LM_LOSS, raise an AssertionError (or use pytest.fail) that includes hf_config_name and the observed actual_lm_loss and instructs the maintainer to add an entry to EXPECTED_LM_LOSS; update the code paths around EXPECTED_LM_LOSS lookup (where hf_config_name and actual_lm_loss are used) so unknown configs do not silently pass.
269-278: Avoid silent pass when pruning baselines are missing.This branch only prints values and allows the test to pass, which weakens regression detection for new models. Prefer failing with a clear message so baseline updates are explicit.
Proposed fix
- else: - # Print values for new models - update EXPECTED_PRUNING_VALUES with these - print(f"\n=== PRUNING VALUES for {hf_config_name} (num_layers={len(layer_names)}) ===") - print(f'"{hf_config_name}": [') - for layer_name in layer_names: - layer_data = pruning_scores[layer_name] - score = layer_data["score"][0].item() - channels = layer_data["channels_importance_ascending"][0].item() - print(f' {{"score": {score}, "channels": {channels}}},') - print("],") - print("===") + else: + pytest.fail( + f"Missing EXPECTED_PRUNING_VALUES baseline for '{hf_config_name}'. " + "Add expected score/channel values before enabling this config." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/test_puzzletron.py` around lines 269 - 278, Replace the silent-print-only branch that prints pruning values with an explicit check against the expected baselines: verify that the current hf_config_name exists in EXPECTED_PRUNING_VALUES and that each layer's reported pruning score/channels in pruning_scores matches the stored baseline (or raise an AssertionError with a clear message if the baseline is missing or mismatched). Locate where pruning_scores and layer_names are used (the print block) and change it to assert hf_config_name in EXPECTED_PRUNING_VALUES, then iterate layer_names comparing pruning_scores[layer_name]["score"] and ["channels_importance_ascending"] to the EXPECTED_PRUNING_VALUES entry; if absent or different, raise an error instructing the reviewer to update EXPECTED_PRUNING_VALUES. Ensure the error message includes hf_config_name and which layer failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 471-477: The function save_model_config currently mutates
model_config.block_configs in place which alters the live model.config; instead,
create a deep copy of the PretrainedConfig (use copy.deepcopy) and perform the
dataclass-to-dict conversion on the copy’s block_configs, then call
save_pretrained on the copied config (reference: save_model_config,
model_config.block_configs, and model_config.save_pretrained). This preserves
the original model_config and avoids side effects for callers.
- Around line 108-117: The function force_cache_dynamic_modules currently always
loads remote dynamic modules via get_class_from_dynamic_module; update its
signature to accept a trust_remote_code: bool parameter, then add an early
return when trust_remote_code is False to avoid invoking
get_class_from_dynamic_module; update the caller in load_model_config to pass
through its trust_remote_code argument so dynamic-module caching only runs when
trust_remote_code is True.
---
Duplicate comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 200-210: The index is written before tied-embedding pruning and
the code checks the hardcoded "lm_head.weight" key; update
checkpoint_utils_hf.py so you first perform the tie_word_embeddings pruning
using model_config.tie_word_embeddings and the descriptor-derived key
lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" to remove
entries from state_dict and weight_map, and only then build and write the index
(index = {"metadata": {"format": "pt"}, "weight_map": weight_map},
json_dumps(index), _write_file_process_safe) using SAFE_WEIGHTS_INDEX_NAME;
ensure you also stop checking the literal "lm_head.weight" and use the
descriptor-derived lm_head_weight_name consistently.
In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 248-251: Replace the direct dictionary access
EXPECTED_PRUNING_VALUES[hf_config_name] with a safe lookup using
EXPECTED_PRUNING_VALUES.get(hf_config_name) so that the variable expected can be
None and the subsequent if expected is not None check remains reachable; update
the assignment to use .get(...) for hf_config_name when computing expected in
the test_puzzletron.py block.
- Around line 122-182: Restore the commented-out end-to-end Puzzletron flow by
re-enabling the call to puzzletron.puzzletron(...) and the subsequent rank==0
assertions (including checks that reference _assert_lm_loss,
_assert_score_pruning_activations, _assert_mip_solutions, and file existence
assertions for replacement_library.json, subblock_stats.json, and
solution_0.json); if the full flow cannot be reliably executed in CI, replace
the restored block with an explicit pytest.skip or pytest.xfail that documents
why the E2E flow is disabled so the test suite fails loudly until the
pruning/MIP/realize implementation is fixed.
---
Nitpick comments:
In `@tests/gpu/torch/puzzletron/test_puzzletron.py`:
- Around line 295-299: The current else branch that only prints LM loss for
unknown hf_config_name lets tests pass without validating against
EXPECTED_LM_LOSS; replace the print-only fallback with a hard fail: when
hf_config_name is not in EXPECTED_LM_LOSS, raise an AssertionError (or use
pytest.fail) that includes hf_config_name and the observed actual_lm_loss and
instructs the maintainer to add an entry to EXPECTED_LM_LOSS; update the code
paths around EXPECTED_LM_LOSS lookup (where hf_config_name and actual_lm_loss
are used) so unknown configs do not silently pass.
- Around line 269-278: Replace the silent-print-only branch that prints pruning
values with an explicit check against the expected baselines: verify that the
current hf_config_name exists in EXPECTED_PRUNING_VALUES and that each layer's
reported pruning score/channels in pruning_scores matches the stored baseline
(or raise an AssertionError with a clear message if the baseline is missing or
mismatched). Locate where pruning_scores and layer_names are used (the print
block) and change it to assert hf_config_name in EXPECTED_PRUNING_VALUES, then
iterate layer_names comparing pruning_scores[layer_name]["score"] and
["channels_importance_ascending"] to the EXPECTED_PRUNING_VALUES entry; if
absent or different, raise an error instructing the reviewer to update
EXPECTED_PRUNING_VALUES. Ensure the error message includes hf_config_name and
which layer failed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0d963718-efe6-4ef4-9905-984c62cc8508
📒 Files selected for processing (2)
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.pytests/gpu/torch/puzzletron/test_puzzletron.py
| def save_model_config(model_config: PretrainedConfig, checkpoint_dir: Path | str) -> None: | ||
| if hasattr(model_config, "block_configs"): | ||
| model_config.block_configs = [ | ||
| dataclasses.asdict(conf) if dataclasses.is_dataclass(conf) else conf | ||
| for conf in model_config.block_configs | ||
| ] | ||
| model_config.save_pretrained(checkpoint_dir) |
There was a problem hiding this comment.
Avoid mutating model.config while preparing it for serialization.
save_model_config rewrites model_config.block_configs in place. Since _save_checkpoint passes model.config, this mutates the live model config after save and can break callers that expect dataclass entries.
♻️ Proposed fix
+import copy
@@
def save_model_config(model_config: PretrainedConfig, checkpoint_dir: Path | str) -> None:
- if hasattr(model_config, "block_configs"):
- model_config.block_configs = [
+ config_to_save = copy.deepcopy(model_config)
+ if hasattr(config_to_save, "block_configs"):
+ config_to_save.block_configs = [
dataclasses.asdict(conf) if dataclasses.is_dataclass(conf) else conf
- for conf in model_config.block_configs
+ for conf in config_to_save.block_configs
]
- model_config.save_pretrained(checkpoint_dir)
+ config_to_save.save_pretrained(checkpoint_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 471 -
477, The function save_model_config currently mutates model_config.block_configs
in place which alters the live model.config; instead, create a deep copy of the
PretrainedConfig (use copy.deepcopy) and perform the dataclass-to-dict
conversion on the copy’s block_configs, then call save_pretrained on the copied
config (reference: save_model_config, model_config.block_configs, and
model_config.save_pretrained). This preserves the original model_config and
avoids side effects for callers.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py (1)
30-31: Avoid an unconditional long-lived skip for this integration test.Line 30 disables coverage entirely; this can hide regressions until re-enable is forgotten. Prefer
skipifwith a concrete gate (orxfail) plus tracking reference.Proposed tweak
-@pytest.mark.skip(reason="Temporarily disabled") +@pytest.mark.skipif( + True, # TODO: replace with real gate (e.g., feature flag/env check) + reason="Temporarily disabled; re-enable after AnyModel merge (track in issue `#XXXX`)", +) def test_nas_search(project_root_path: Path, tmp_path: Path):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py` around lines 30 - 31, The unconditional decorator `@pytest.mark.skip` on the test_nas_search function should be replaced with a conditional gate (e.g., pytest.mark.skipif) or pytest.mark.xfail so the test is only skipped when a concrete condition is met; update the decorator on test_nas_search to use a clear runtime condition (for example an environment flag like os.getenv("RUN_INTEGRATION_TESTS") or a pytest config marker) and include a concise reason string that references a tracking ticket/PR ID (e.g., "skip unless RUN_INTEGRATION_TESTS=true — see ISSUE-1234") so the skip is explicit, gated, and traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml`:
- Line 80: The target_memory value and inline comment disagree: keep them
consistent by updating the inline comment or the numeric literal in
llama_3_1_8b_instruct-attn-pruning.yaml so both represent the same amount; e.g.,
if the intended memory is seven hundred eighty thousand, change the comment next
to target_memory: 780_000 to "# 780_000", or if the intended value is
seventy‑eight thousand, change the numeric literal to target_memory: 78_000;
ensure underscore grouping matches the chosen value.
---
Nitpick comments:
In `@tests/gpu/torch/puzzletron/nas/plugins/test_nas_search.py`:
- Around line 30-31: The unconditional decorator `@pytest.mark.skip` on the
test_nas_search function should be replaced with a conditional gate (e.g.,
pytest.mark.skipif) or pytest.mark.xfail so the test is only skipped when a
concrete condition is met; update the decorator on test_nas_search to use a
clear runtime condition (for example an environment flag like
os.getenv("RUN_INTEGRATION_TESTS") or a pytest config marker) and include a
concise reason string that references a tracking ticket/PR ID (e.g., "skip
unless RUN_INTEGRATION_TESTS=true — see ISSUE-1234") so the skip is explicit,
gated, and traceable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b8f4094-9fc4-4ad2-998a-624d29b70d8a
📒 Files selected for processing (4)
modelopt/torch/puzzletron/nas/plugins/puzzletron_nas_plugin.pytests/gpu/torch/puzzletron/nas/plugins/test_nas_convert.pytests/gpu/torch/puzzletron/nas/plugins/test_nas_search.pytests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/gpu/torch/puzzletron/nas/plugins/test_nas_convert.py
| - stats.attention_num_params | ||
|
|
||
| human_constraints: | ||
| target_memory: 780_000 # 78_000 |
There was a problem hiding this comment.
Resolve the target_memory value/comment mismatch.
Line 80 has 780_000 but the inline comment says 78_000. This 10x discrepancy can materially alter solution feasibility and test behavior.
Proposed change (pick one, value or comment)
- target_memory: 780_000 # 78_000
+ target_memory: 78_000 # 78_000📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| target_memory: 780_000 # 78_000 | |
| target_memory: 78_000 # 78_000 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct-attn-pruning.yaml`
at line 80, The target_memory value and inline comment disagree: keep them
consistent by updating the inline comment or the numeric literal in
llama_3_1_8b_instruct-attn-pruning.yaml so both represent the same amount; e.g.,
if the intended memory is seven hundred eighty thousand, change the comment next
to target_memory: 780_000 to "# 780_000", or if the intended value is
seventy‑eight thousand, change the numeric literal to target_memory: 78_000;
ensure underscore grouping matches the chosen value.
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
| from torch import nn | ||
|
|
||
| import modelopt.torch.utils.distributed as dist | ||
| from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.block_config import BlockConfig # noqa: TC001 |
There was a problem hiding this comment.
Is the BlockConfig the only thing from deci_lm_hf_code we still use? Or are there other things also?
| TODO: Implement get_router_logits_and_routed_experts based on GPT-OSS MoE forward pass. | ||
| This is a placeholder implementation that allows the framework to run. |
There was a problem hiding this comment.
Will this be added in follow-up PRs or still not supported?
| if not pretrained: | ||
| raise ValueError("pretrained must be provided") |
There was a problem hiding this comment.
If pretrained is a required argument, why dont we just remove | None in the type annotation?
| block_config = BlockConfig( | ||
| attention=AttentionConfig(no_op=False, num_key_value_heads=config.num_key_value_heads), | ||
| ffn=FFNConfig(no_op=False, intermediate_size=config.intermediate_size), | ||
| ).to_dict() | ||
|
|
||
| block_configs = [block_config] * num_hidden_layers | ||
| return block_configs |
There was a problem hiding this comment.
Can you add a one line docstring what puzzformer module does?
| index = {"metadata": {"format": "pt"}, "weight_map": weight_map} | ||
| index_path = checkpoint_dir / SAFE_WEIGHTS_INDEX_NAME | ||
| index_json = json_dumps(index) | ||
| _write_file_process_safe(index_json, index_path) | ||
|
|
||
| # Phase 4: Copy HF code | ||
| phase4_start_time = time.time() | ||
| copy_deci_lm_hf_code(checkpoint_dir) | ||
| phase4_time = time.time() - phase4_start_time | ||
| mprint(f"Phase 4 - Copy HF code: {phase4_time:.2f}s") | ||
| # Handle tie_word_embeddings - don't save lm_head.weight if it's tied to embed_tokens | ||
| if getattr(model_config, "tie_word_embeddings", False) and "lm_head.weight" in state_dict: | ||
| lm_head_weight_name = f"{descriptor.output_embedding_name()}.weight" | ||
| state_dict = {k: v for k, v in state_dict.items() if k != lm_head_weight_name} | ||
| weight_map = {k: v for k, v in weight_map.items() if k != lm_head_weight_name} |
| ) -> tuple[Path, Path, Path]: | ||
| """ | ||
| Setup the test model and data for the puzzletron NAS search. | ||
| Setup the test model and data for the compress NAS search. |
There was a problem hiding this comment.
please revert this name change diff from compress to puzzletron
| # Create and save the Llama model | ||
| model = LlamaForCausalLM(llama_config) | ||
| # VL models have nested configs (text_config, vision_config) | ||
| if hf_config_name == "qwen3-vl-30b-a3b-instruct": |
There was a problem hiding this comment.
What if we instead do if hasattr(config, "text_config") and hasattr(config, "vision_config")? That would be more robust
| layer_descriptor: | ||
| _target_: modelopt.torch.puzzletron.anymodel.models.llama.llama_model_descriptor.LlamaFFNIntermediateLayerDescriptor |
There was a problem hiding this comment.
Future improvement: infer layer descriptor automatically based on ckpt config
There was a problem hiding this comment.
Is this file for ffn_pruning? Can you rename so its clear. Perhaps we can move common attn/ffn pruning components in a separate yaml instead of copying in 2 places?
What does this PR do?
Summary by CodeRabbit
New Features
Documentation
Tests
Chores