Conversation
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughMulti-node GPU counting and FSDP argument logic in the speculative decoding training script were updated; a registry-based HF model patch system was added and integrated, moving Kimi-K2 special-casing into model-specific patches and adding a decoder-forward compatibility wrapper. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/speculative_decoding/main.py (2)
173-176:⚠️ Potential issue | 🔴 CriticalCRITICAL:
trust_remote_code=Trueis hardcoded.Per coding guidelines,
trust_remote_code=Trueshould not be hardcoded when loading transformers models. This flag tells Transformers to execute arbitrary Python shipped with a checkpoint, which is an RCE vector if the model source is untrusted. The flag should be exposed as a caller-configurable parameter defaulting toFalse.This same issue appears at multiple locations in this file (lines 174, 176, 185, 188, 192, 195).
Suggested approach
Add a command-line argument to control this behavior:
`@dataclass` class ModelArguments: model_name_or_path: str | None = field(default="TinyLlama/TinyLlama-1.1B-Chat-v1.0") + trust_remote_code: bool = field( + default=False, + metadata={"help": "Whether to trust remote code when loading models/tokenizers."} + )Then use
model_args.trust_remote_codeinstead of hardcodingTrue.As per coding guidelines: "trust_remote_code=True hardcoded for transformers model or tokenizer loading" is flagged as CRITICAL security issue.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/main.py` around lines 173 - 176, The code currently hardcodes trust_remote_code=True when calling load_vlm_or_llm_with_kwargs and transformers.AutoTokenizer.from_pretrained; add a command-line or function argument (e.g., model_args.trust_remote_code defaulting to False) and use that variable instead of True in all places in this file (references: load_vlm_or_llm_with_kwargs, transformers.AutoTokenizer.from_pretrained); update the CLI parser or function signature to expose trust_remote_code to callers and replace every hardcoded True occurrence noted in the review with model_args.trust_remote_code.
148-155:⚠️ Potential issue | 🟠 MajorPotential
AttributeErrorwhencp_size > 1on older transformers versions.If
TRANSFORMERS_VERSION < 5.0andcp_size > 1,training_args.parallelism_configis never assigned, but line 155 attempts to accesstraining_args.parallelism_config.sp_backend, which will raise anAttributeError.Suggested fix
if Version("5.0") <= TRANSFORMERS_VERSION: training_args.parallelism_config = ParallelismConfig( cp_size=training_args.cp_size, dp_shard_size=training_args.dp_shard_size ) - if training_args.cp_size > 1: - patch_ring_attention_for_ttt() - # Specific patch to accelerate 1.12.0. Removable after move to 1.13.0 - training_args.parallelism_config.sp_backend = None + if training_args.cp_size > 1: + patch_ring_attention_for_ttt() + # Specific patch to accelerate 1.12.0. Removable after move to 1.13.0 + training_args.parallelism_config.sp_backend = NoneOr handle the case where
parallelism_configis not available for older versions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/main.py` around lines 148 - 155, If TRANSFORMERS_VERSION < 5.0 and training_args.cp_size > 1, training_args.parallelism_config may be unset and accessing training_args.parallelism_config.sp_backend will raise AttributeError; fix by ensuring training_args.parallelism_config is created before use (e.g., instantiate ParallelismConfig and assign to training_args.parallelism_config when cp_size>1 or guard access), then call patch_ring_attention_for_ttt() and only modify training_args.parallelism_config.sp_backend if the attribute exists; update the block around TRANSFORMERS_VERSION, ParallelismConfig, training_args.cp_size, patch_ring_attention_for_ttt, and sp_backend accordingly.
🧹 Nitpick comments (1)
modelopt/torch/speculative/utils.py (1)
444-453: Consider removingpast_key_valuesfrom kwargs after mapping to avoid potential conflicts.The patch adds
past_key_valuebut doesn't removepast_key_valuesfrom kwargs. If the original forward method has logic that checks for or usespast_key_values, this could cause unexpected behavior or duplicate key issues.Suggested fix
def patched_decoder_layer_fwd(self, *args, **kwargs): - kwargs["past_key_value"] = kwargs.get("past_key_values") + kwargs["past_key_value"] = kwargs.pop("past_key_values", None) return original_decoder_layer_forward(self, *args, **kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/utils.py` around lines 444 - 453, The patched_forward currently copies kwargs["past_key_values"] to kwargs["past_key_value"] but leaves the original key present, which can cause duplicate/conflicting behavior; in patched_decoder_layer_fwd (which replaces kimi_k2_module.DeepseekV3DecoderLayer.forward and calls original_decoder_layer_forward), after mapping set kwargs["past_key_value"] = kwargs.get("past_key_values") and then remove the original by popping kwargs.pop("past_key_values", None) before calling original_decoder_layer_forward(self, *args, **kwargs) so only the expected key remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/speculative/plugins/hf_model_patches.py`:
- Line 28: Module defines a module-level export list as all =
["apply_model_patch"] which is a typo; rename it to __all__ =
["apply_model_patch"] so Python's import mechanism recognizes the intended
exported symbol (apply_model_patch). Update the variable name in the module
where apply_model_patch is defined to use double-underscore __all__ to control
from module import * exports.
- Around line 74-76: The patched _patched_compute_ttt_attention_mask calls
.repeat() on original_func(...) but original_func (the underlying
_compute_ttt_attention_mask) can return a BlockMask (from flex_attention) which
has no repeat method; update _patched_compute_ttt_attention_mask to detect the
return type (e.g., isinstance(tensor_mask, BlockMask) or hasattr(tensor_mask,
"repeat")) and handle both cases: if it's a torch.Tensor call
.repeat(batch_size,1,1,1) as now, and if it's a BlockMask return it unchanged
(or apply the BlockMask-appropriate transformation) so you avoid calling .repeat
on BlockMask; reference the symbols _patched_compute_ttt_attention_mask,
original_func, and BlockMask when making the change.
- Around line 64-67: The code incorrectly accesses a nested attribute on a
CompressedTensorsConfig instance; locate where quant_config is retrieved from
module.config (variable name quant_config) and replace the incorrect access
quant_config.quantization_config.ignore with the direct attribute
quant_config.ignore so it appends "re:.*eagle_module.*" to the
CompressedTensorsConfig.ignore list (ensure you only modify the line that
mutates the ignore list and keep the isinstance check for
CompressedTensorsConfig).
---
Outside diff comments:
In `@examples/speculative_decoding/main.py`:
- Around line 173-176: The code currently hardcodes trust_remote_code=True when
calling load_vlm_or_llm_with_kwargs and
transformers.AutoTokenizer.from_pretrained; add a command-line or function
argument (e.g., model_args.trust_remote_code defaulting to False) and use that
variable instead of True in all places in this file (references:
load_vlm_or_llm_with_kwargs, transformers.AutoTokenizer.from_pretrained); update
the CLI parser or function signature to expose trust_remote_code to callers and
replace every hardcoded True occurrence noted in the review with
model_args.trust_remote_code.
- Around line 148-155: If TRANSFORMERS_VERSION < 5.0 and training_args.cp_size >
1, training_args.parallelism_config may be unset and accessing
training_args.parallelism_config.sp_backend will raise AttributeError; fix by
ensuring training_args.parallelism_config is created before use (e.g.,
instantiate ParallelismConfig and assign to training_args.parallelism_config
when cp_size>1 or guard access), then call patch_ring_attention_for_ttt() and
only modify training_args.parallelism_config.sp_backend if the attribute exists;
update the block around TRANSFORMERS_VERSION, ParallelismConfig,
training_args.cp_size, patch_ring_attention_for_ttt, and sp_backend accordingly.
---
Nitpick comments:
In `@modelopt/torch/speculative/utils.py`:
- Around line 444-453: The patched_forward currently copies
kwargs["past_key_values"] to kwargs["past_key_value"] but leaves the original
key present, which can cause duplicate/conflicting behavior; in
patched_decoder_layer_fwd (which replaces
kimi_k2_module.DeepseekV3DecoderLayer.forward and calls
original_decoder_layer_forward), after mapping set kwargs["past_key_value"] =
kwargs.get("past_key_values") and then remove the original by popping
kwargs.pop("past_key_values", None) before calling
original_decoder_layer_forward(self, *args, **kwargs) so only the expected key
remains.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/speculative_decoding/launch_train.shexamples/speculative_decoding/main.pymodelopt/torch/speculative/plugins/hf_model_patches.pymodelopt/torch/speculative/plugins/transformers.pymodelopt/torch/speculative/utils.py
| from packaging.version import Version | ||
| from transformers.utils.quantization_config import CompressedTensorsConfig | ||
|
|
||
| all = ["apply_model_patch"] |
There was a problem hiding this comment.
Typo: all should be __all__.
The module-level all variable has no special meaning in Python. It should be __all__ (with double underscores) to properly control what gets exported when using from module import *.
Fix
-all = ["apply_model_patch"]
+__all__ = ["apply_model_patch"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/speculative/plugins/hf_model_patches.py` at line 28, Module
defines a module-level export list as all = ["apply_model_patch"] which is a
typo; rename it to __all__ = ["apply_model_patch"] so Python's import mechanism
recognizes the intended exported symbol (apply_model_patch). Update the variable
name in the module where apply_model_patch is defined to use double-underscore
__all__ to control from module import * exports.
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
What does this PR do?
Type of change: Refactor
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
Bug Fixes
New Features
Compatibility