Fine-grained learning rate control for LoRA and full-finetune training. Instead of one flat LR for all parameters, each trainable parameter gets:
lr = component_lr[type] * depth_fn(block_idx, num_blocks)
This creates a 2D grid: component type (self_attn, cross_attn, ffn, norm) x block depth (0 to N-1).
Status: Implemented in Serenity. Works with all supported model families. No changes to Stagehand, Activation Stagehand, or the training loop itself -- only optimizer param group construction.
Standard LoRA training uses one learning rate for every parameter. But:
- Shallow blocks see more general features (edges, colors). They benefit from higher LR to adapt quickly.
- Deep blocks see more abstract/semantic features. They're more sensitive and can destabilize with high LR.
- Self-attention (Q/K/V projections) directly controls what the model attends to. Usually needs a higher LR.
- Cross-attention controls how the model processes text conditioning. Often benefits from a lower LR to avoid destroying text alignment.
- FFN layers are the "memory" of each block. Moderate LR works well.
- Norm layers are rarely LoRA-targeted, but when they are, they need very low LR.
The per-layer LR system lets you express all of this in one config block.
Add a per_layer_lr block to your training JSON config:
{
"learning_rate": 0.0004,
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.0002,
"ffn": 0.0003,
"norm": 0.00005
},
"depth_strategy": "linear",
"min_depth_factor": 0.3,
"peak_position": 0.5,
"default_lr": 0.0003
}
}| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Master switch |
component_lr |
dict | see above | Base LR per component type before depth scaling |
depth_strategy |
string | "linear" |
Depth function: linear, cosine, inverted_u, flat |
min_depth_factor |
float | 0.3 |
Minimum depth scaling factor (0.0 - 1.0) |
peak_position |
float | 0.5 |
Peak position for inverted_u strategy (0.0 - 1.0) |
default_lr |
float | 0.0003 |
LR for unmatched parameters and optimizer base LR |
log_grad_norms |
bool | false |
Reserved for future gradient norm logging |
When per_layer_lr.enabled is false or absent, training uses the flat learning_rate as before. No behavior change.
All strategies produce a factor between min_depth_factor and 1.0. The final LR is component_lr[type] * factor.
Linearly decays from 1.0 at block 0 to min_depth_factor at block N-1.
factor(idx) = 1.0 - (idx / (N-1)) * (1.0 - min_depth_factor)
Block: 0 5 10 15 19
Factor: 1.00 0.82 0.63 0.45 0.30
Best for: general-purpose training where shallow blocks should learn faster.
Cosine decay from 1.0 to min_depth_factor. More gradual at the start, steeper in the middle.
factor(idx) = min_factor + (1 - min_factor) * 0.5 * (1 + cos(pi * idx/(N-1)))
Block: 0 5 10 15 19
Factor: 1.00 0.90 0.65 0.40 0.30
Best for: smoother transitions when you don't want a sharp LR cliff.
Quadratic peak at peak_position, decaying to min_depth_factor at both ends.
Block: 0 5 10 15 19
Factor: 0.30 0.75 1.00 0.75 0.30 (peak_position=0.5)
Best for: when middle blocks are most important (common in image models where mid-level features matter most for style).
No depth variation. Factor is always 1.0. Only component-level LR differences apply.
Block: 0 5 10 15 19
Factor: 1.00 1.00 1.00 1.00 1.00
Best for: when you only want per-component LR without depth scaling.
Each trainable parameter is classified by matching its name against these patterns:
| Component | Patterns | Typical Role |
|---|---|---|
self_attn |
attn.to_q/k/v/out, attn1.to_*, attention.to_*, to_q.lora/to_k.lora/to_v.lora |
Self-attention Q/K/V projections |
cross_attn |
attn.add_q/k/v_proj, attn.to_add_out, attn2.to_*, ff_context.* |
Cross-attention with text stream |
ffn |
ff.linear_in/out, ff.net.*, ffn.net.*, mlp.*, feed_forward.* |
Feed-forward network |
norm |
norm |
Layer/group normalization |
Parameters that don't match any pattern get default_lr.
Not all models have all component types:
| Model Family | self_attn | cross_attn | ffn | Notes |
|---|---|---|---|---|
| Flux 2 / Klein | attn.to_q/k/v/out |
attn.add_q/k/v_proj, ff_context.* |
ff.linear_in/out |
Double-stream: self_attn = image, cross_attn = text |
| SD 1.5 / SDXL | attn1.to_q/k/v |
attn2.to_q/k/v |
ff.net.* |
Classic U-Net cross-attention |
| SD3 / SD3.5 | attn.to_q/k/v |
-- | ff.net.* |
Joint attention (no separate cross) |
| WAN 2.x | attn1.to_q/k/v |
attn2.to_q/k/v |
ffn.net.* |
Video model with self + cross |
| LTX-2 | to_q/k/v (bare) |
-- | ff.net.* |
Bare names, no attn. prefix |
| Qwen | attn.to_q/k/v |
-- | mlp.gate/up/down_proj |
Gated MLP |
| ZImage | attention.to_q/k/v |
-- | feed_forward.* |
Uses attention. prefix |
| Chroma / HunyuanVideo / PixArt / Sana | attn.to_q/k/v |
-- | ff.net.* |
Standard naming |
Small model, fast iteration. Emphasize self-attention for style capture, moderate FFN, low cross-attention to preserve prompt following.
{
"model_type": "flux_2_klein_4b",
"learning_rate": 0.0004,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0005,
"cross_attn": 0.00015,
"ffn": 0.0003,
"norm": 0.00005
},
"depth_strategy": "inverted_u",
"min_depth_factor": 0.4,
"peak_position": 0.5,
"default_lr": 0.0003
}
}Why inverted_u: Klein 4B has fewer blocks. Mid-level blocks capture the most style information. Peak at 0.5 gives highest LR to the middle of the network.
Larger model, subject fidelity matters. Higher self-attention LR, preserve text alignment with low cross-attention LR.
{
"model_type": "flux_2_klein_9b",
"learning_rate": 0.0004,
"adapter": {
"type": "lora",
"rank": 32,
"alpha": 32
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.0001,
"ffn": 0.00025,
"norm": 0.00003
},
"depth_strategy": "linear",
"min_depth_factor": 0.3,
"default_lr": 0.00025
}
}Why linear: For subject LoRA, shallow blocks learn identity features (face structure, body shape). Deep blocks are more semantic and should change less.
When training all parameters (no LoRA), per-layer LR prevents catastrophic forgetting in deep blocks.
{
"model_type": "flux_2_dev",
"training_method": "fine_tune",
"learning_rate": 0.00001,
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.000015,
"cross_attn": 0.000005,
"ffn": 0.00001,
"norm": 0.000002
},
"depth_strategy": "cosine",
"min_depth_factor": 0.2,
"default_lr": 0.00001
}
}Why cosine: Smoother decay prevents abrupt LR changes between adjacent blocks. The 0.2 min factor means deep blocks still learn at 1/5 the rate of shallow blocks.
Classic model with separate self-attention and cross-attention blocks.
{
"model_type": "sd15",
"learning_rate": 0.0001,
"adapter": {
"type": "lora",
"rank": 8,
"alpha": 8
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.00015,
"cross_attn": 0.00005,
"ffn": 0.0001,
"norm": 0.00002
},
"depth_strategy": "inverted_u",
"min_depth_factor": 0.3,
"peak_position": 0.4,
"default_lr": 0.0001
}
}Why peak_position=0.4: U-Net architecture has down/mid/up blocks. The most expressive features are in the early-mid range. Shifting peak earlier captures this.
{
"model_type": "sdxl",
"learning_rate": 0.0001,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.00012,
"cross_attn": 0.00004,
"ffn": 0.00008,
"norm": 0.00001
},
"depth_strategy": "linear",
"min_depth_factor": 0.35,
"default_lr": 0.00008
}
}SD3 uses joint attention (no separate cross-attention). All text/image interaction happens through self-attention.
{
"model_type": "sd35",
"learning_rate": 0.0002,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.00025,
"ffn": 0.00015,
"norm": 0.00003
},
"depth_strategy": "cosine",
"min_depth_factor": 0.3,
"default_lr": 0.00015
}
}Note: No cross_attn entry needed -- SD3 has no separate cross-attention.
LTX-2 has 48 transformer blocks. With Stagehand streaming blocks through GPU one at a time, per-layer LR adds zero VRAM overhead (it only affects optimizer param group construction).
{
"model_type": "ltx2",
"learning_rate": 0.0003,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"ffn": 0.00025,
"norm": 0.00003
},
"depth_strategy": "linear",
"min_depth_factor": 0.2,
"default_lr": 0.00025
},
"memory": {
"stagehand": {
"enabled": true,
"pinned_pool_mb": 8192,
"prefetch_window": 2
}
}
}Why min_depth_factor=0.2: With 48 blocks, a 0.3 min factor means block 47 gets lr * 0.3. With 0.2, the deepest blocks change even less -- good for video temporal consistency.
WAN has both self-attention (attn1) and cross-attention (attn2). The dual-stage path (WAN 2.2) also supports per-layer LR.
{
"model_type": "wan",
"learning_rate": 0.0003,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.00015,
"ffn": 0.00025,
"norm": 0.00003
},
"depth_strategy": "cosine",
"min_depth_factor": 0.25,
"default_lr": 0.00025
}
}Qwen uses gated MLP (gate_proj, up_proj, down_proj) instead of standard FFN.
{
"model_type": "qwen",
"learning_rate": 0.0002,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.00025,
"ffn": 0.00015,
"norm": 0.00002
},
"depth_strategy": "linear",
"min_depth_factor": 0.3,
"default_lr": 0.00015
}
}ZImage uses attention.to_q/k/v and feed_forward.* naming.
{
"model_type": "zimage",
"learning_rate": 0.0003,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"ffn": 0.00025,
"norm": 0.00003
},
"depth_strategy": "inverted_u",
"min_depth_factor": 0.35,
"peak_position": 0.45,
"default_lr": 0.00025
}
}{
"model_type": "chroma_1",
"learning_rate": 0.0003,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.0004,
"ffn": 0.0002,
"norm": 0.00003
},
"depth_strategy": "cosine",
"min_depth_factor": 0.3,
"default_lr": 0.0002
}
}{
"model_type": "hunyuan_video",
"learning_rate": 0.0003,
"adapter": {
"type": "lora",
"rank": 16,
"alpha": 16
},
"per_layer_lr": {
"enabled": true,
"component_lr": {
"self_attn": 0.00035,
"ffn": 0.0002,
"norm": 0.00003
},
"depth_strategy": "linear",
"min_depth_factor": 0.25,
"default_lr": 0.0002
}
}Mid-level features matter most. Use inverted_u with peak at 0.4-0.6.
"depth_strategy": "inverted_u",
"min_depth_factor": 0.3,
"peak_position": 0.5,
"component_lr": {
"self_attn": 0.0005,
"cross_attn": 0.0002,
"ffn": 0.0003
}Shallow blocks learn identity. Deep blocks should change less. Use linear.
"depth_strategy": "linear",
"min_depth_factor": 0.3,
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.0001,
"ffn": 0.00025
}Cross-attention is where text-to-image binding lives. Give it moderate LR. Use cosine for smooth decay.
"depth_strategy": "cosine",
"min_depth_factor": 0.35,
"component_lr": {
"self_attn": 0.0003,
"cross_attn": 0.00025,
"ffn": 0.0002
}Much lower LRs. Cosine decay with aggressive min factor to protect deep semantic blocks.
"depth_strategy": "cosine",
"min_depth_factor": 0.15,
"component_lr": {
"self_attn": 0.000015,
"cross_attn": 0.000005,
"ffn": 0.00001
}When you only want different LR per component, not per depth:
"depth_strategy": "flat",
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.0001,
"ffn": 0.0003
}Small depth variation, mostly flat. Good starting point if unsure.
"depth_strategy": "linear",
"min_depth_factor": 0.7,
"component_lr": {
"self_attn": 0.0004,
"cross_attn": 0.0003,
"ffn": 0.00035
}When enabled, the full 2D LR grid is printed at startup:
[per_layer_lr] model_type=flux_2_klein_4b strategy=linear min_depth_factor=0.3 groups=24 total_params=192
transformer_blocks.000.cross_attn lr=0.000200 params=8
transformer_blocks.000.ffn lr=0.000300 params=4
transformer_blocks.000.self_attn lr=0.000400 params=8
transformer_blocks.001.cross_attn lr=0.000177 params=8
transformer_blocks.001.ffn lr=0.000267 params=4
transformer_blocks.001.self_attn lr=0.000356 params=8
...
transformer_blocks.011.cross_attn lr=0.000060 params=8
transformer_blocks.011.ffn lr=0.000090 params=4
transformer_blocks.011.self_attn lr=0.000120 params=8
single_transformer_blocks.000.ffn lr=0.000300 params=4
single_transformer_blocks.000.self_attn lr=0.000400 params=8
...
This makes it easy to verify the schedule is what you intended before training starts.
Per-layer LR is orthogonal to Stagehand. It only affects optimizer param group construction at startup. During training:
- Stagehand streams blocks through GPU as usual
- The optimizer already has per-group LRs assigned
- No additional VRAM overhead
- No additional compute overhead
- Works with both block mode and layer mode Stagehand
The LR scheduler (cosine, linear, constant, etc.) applies its multiplicative factor to each param group's LR independently, so warmup and decay work correctly with per-layer LR.
PyTorch's LambdaLR scheduler multiplies each param group's base LR by a factor. With per-layer LR, each group has a different base LR, so:
- Warmup ramps all groups proportionally
- Cosine/linear decay scales all groups proportionally
- The relative ratios between groups are preserved throughout training
For example, if self_attn starts at 0.0004 and ffn starts at 0.0003, after a cosine decay to 50%:
- self_attn = 0.0002
- ffn = 0.00015
The 4:3 ratio is maintained.