From 561b30430104bca1ddef6efb6c8a1ac44e18b015 Mon Sep 17 00:00:00 2001
From: unnawut <unnawut@unnawut.com>
Date: Fri, 16 Jan 2026 16:58:53 +0700
Subject: [PATCH 1/3] feat: add consensus testing skill to code-tester agent

---
 .claude/agents/code-tester.md       |  16 ++-
 .claude/skills/consensus-testing.md | 152 ++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 .claude/skills/consensus-testing.md

diff --git a/.claude/agents/code-tester.md b/.claude/agents/code-tester.md
index a0a6570a..9bef1770 100644
--- a/.claude/agents/code-tester.md
+++ b/.claude/agents/code-tester.md
@@ -11,6 +11,18 @@ You are SpecForge, an elite Test Engineer specializing in the Lean Ethereum Cons
 
 Generate rigorous, comprehensive unit tests and spec test fillers for the leanSpec repository. Your tests verify spec compliance and ensure cross-client interoperability across all modules.
 
+## Auto-Invoke Skills
+
+### Consensus Testing
+
+When writing tests for consensus-related code, invoke the `/consensus-testing` skill first to load specialized multi-validator testing patterns.
+
+**Triggers to invoke the skill:**
+- Test file is in `tests/consensus/`
+- Testing functions like `process_block`, `on_block`, `on_attestation`
+- Code involves validators, attestations, or justification/finalization
+- Fork choice or state transition scenarios with multiple validators
+
 ## Workflow (Follow This Order)
 
 ### 1. Explore First
@@ -20,7 +32,8 @@ Generate rigorous, comprehensive unit tests and spec test fillers for the leanSp
 - Map out exception types and when they're raised
 
 ### 2. Check Existing Tests
-- Search `tests/lean_spec/` for related test files
+- Search `tests/lean_spec/` for related unit test files
+- Search `tests/consensus/` for related spec test filler files
 - Match the established style and naming conventions
 - Avoid duplicating existing test coverage
 - Identify gaps in current coverage
@@ -37,6 +50,7 @@ Generate rigorous, comprehensive unit tests and spec test fillers for the leanSp
 
 ### 5. Verify
 - Run `uv run pytest <test_file>` to ensure tests pass
+- Run `uv run fill --clean --fork=devnet <test_file>` to ensure test fillers pass
 - Run `uv run ruff check <test_file>` for linting
 - Run `uv run ruff format <test_file>` for formatting
 - Fix any issues before presenting results
diff --git a/.claude/skills/consensus-testing.md b/.claude/skills/consensus-testing.md
new file mode 100644
index 00000000..fd2c4589
--- /dev/null
+++ b/.claude/skills/consensus-testing.md
@@ -0,0 +1,152 @@
+---
+name: consensus-testing
+description: "Specialized patterns for testing consensus and fork choice code with multiple validators. Use when writing tests in tests/consensus/, or when testing functions involving validators, attestations, justification, or finalization."
+---
+
+# Consensus & Fork Choice Testing Patterns
+
+Testing consensus logic requires understanding how validators interact. Single-validator tests miss critical dynamics.
+
+## Multi-Validator Test Design
+
+**Minimum validator counts by scenario:**
+- Basic consensus: 4 validators (allows 1 byzantine, maintains 2/3 honest)
+- Justification threshold: 8+ validators (clean 2/3 math)
+
+**Always vary the validator set composition:**
+- All validators honest and online
+- Supermajority honest (exactly 2/3 + 1)
+- At justification threshold (exactly 2/3)
+- Below threshold (2/3 - 1, should fail to justify)
+- Mixed online/offline validators
+
+## Validator Relationship Scenarios
+
+Test how validators interact, not just individual behavior:
+
+**Attestation patterns:**
+- All validators attest to same head (happy path)
+- Validators split between two competing heads
+- Staggered attestations across slots
+- Late attestations arriving after new blocks
+- Missing attestations from subset of validators
+
+**Proposer/attester dynamics:**
+- Proposer includes own attestation
+- Proposer excludes valid attestations (censorship)
+- Attestations reference proposer's parent (not proposer's block)
+- Multiple blocks proposed for same slot (equivocation)
+
+**Committee behavior:**
+- Full committee participation
+- Partial committee (threshold edge cases)
+- Empty committee attestations
+- Cross-committee attestation conflicts
+
+## Fork Choice Scenarios
+
+Fork choice tests must exercise competing chain heads:
+
+**Branch competition:**
+```
+         +-- B2a <- B3a (3 attestations)
+genesis <- B1 -+
+         +-- B2b <- B3b (4 attestations)  <- winner
+```
+- Test that head follows attestation weight
+- Verify re-org when new attestations shift weight
+- Check tie-breaking rules when weights equal
+
+**Critical scenarios to cover:**
+1. **Weight transitions**: Head changes as attestations arrive
+2. **Deep re-orgs**: New branch overtakes after multiple slots
+3. **Equivocation handling**: Same validator attests to conflicting heads
+4. **Checkpoint boundaries**: Behavior at epoch transitions
+5. **Finalization effects**: Finalized blocks cannot be re-orged
+
+## Justification & Finalization
+
+The 2/3 supermajority threshold is critical:
+
+**Justification tests:**
+- Exactly 2/3 participation -> should justify
+- One less than 2/3 -> should NOT justify
+- Validators with different effective balances (weighted voting)
+- Justification with gaps (skip epochs)
+
+**Finalization tests:**
+- Two consecutive justified epochs -> finalization
+- Justified but not finalized (gap in justification)
+- Finalization with varying participation rates
+- Cannot finalize without prior justification
+
+## Timing & Ordering
+
+Consensus is sensitive to when events occur:
+
+**Test event orderings:**
+- Attestation before vs after block arrival
+- Multiple attestations in same slot vs spread across slots
+- Block arrives late (after attestation deadline)
+- Out-of-order block delivery (child before parent)
+
+**Slot boundary behavior:**
+- Actions at slot start vs slot end
+- Crossing epoch boundaries
+- Genesis slot special cases
+
+## Spec Filler Patterns for Fork Choice
+
+```python
+def test_competing_branches(fork_choice_test: ForkChoiceTestFiller) -> None:
+    """Fork choice selects branch with higher attestation weight."""
+    fork_choice_test(
+        anchor_state=genesis_state,
+        anchor_block=genesis_block,
+        steps=[
+            # Build competing branches
+            OnBlock(block=block_2a),
+            OnBlock(block=block_2b),
+            # Add attestations favoring branch b
+            OnAttestation(attestation=att_for_2b_validator_0),
+            OnAttestation(attestation=att_for_2b_validator_1),
+            OnAttestation(attestation=att_for_2a_validator_2),
+            # Verify head follows weight
+            Checks(head=block_2b.hash_tree_root()),
+        ],
+    )
+```
+
+## State Transition with Multiple Validators
+
+```python
+def test_justification_threshold(state_transition_test: StateTransitionTestFiller) -> None:
+    """State justifies checkpoint when 2/3 validators attest."""
+    # Create state with 8 validators
+    state = create_state_with_validators(count=8)
+
+    # Block with attestations from exactly 6/8 validators (75% > 2/3)
+    block = create_block_with_attestations(
+        state=state,
+        attesting_validators=[0, 1, 2, 3, 4, 5],  # 6 of 8
+    )
+
+    state_transition_test(
+        pre=state,
+        blocks=[block],
+        post=StateExpectation(
+            current_justified_checkpoint=expected_checkpoint,
+        ),
+    )
+```
+
+## Common Pitfalls
+
+Avoid these testing mistakes:
+
+1. **Single validator tests** - Miss consensus dynamics entirely
+2. **Always-honest scenarios** - Never test byzantine behavior
+3. **Ignoring weights** - Validators may have different balances
+4. **Fixed ordering** - Real networks have non-deterministic message arrival
+5. **Skipping threshold edges** - The 2/3 boundary is where bugs hide
+6. **Testing implementation** - Test spec behavior, not internal state

From f7c84c5b9a270348194937e8b11abac1f1a084cd Mon Sep 17 00:00:00 2001
From: unnawut <unnawut@unnawut.com>
Date: Fri, 16 Jan 2026 16:59:12 +0700
Subject: [PATCH 2/3] fix: fork choice tests with too low validators

---
 .../devnet/fc/test_fork_choice_reorgs.py      | 39 ++++++++++++-------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/tests/consensus/devnet/fc/test_fork_choice_reorgs.py b/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
index 78c79b19..b7c25b68 100644
--- a/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
+++ b/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
@@ -210,10 +210,10 @@ def test_three_block_deep_reorg(
     - Slots 2-5: Fork B slowly builds, then surpasses with 4 blocks
 
     Timeline:
-        Slot 2: Fork A leads (1 vs 0)
-        Slot 3: Fork A leads (2 vs 1)
-        Slot 4: Fork A leads (3 vs 2)
-        Slot 5: Fork B overtakes (4 vs 3) → 3-block deep reorg
+        Slot 2: Fork A leads (depth 1 vs 0)
+        Slot 3: Fork A leads (depth 2 vs 1)
+        Slot 4: Fork A leads (depth 3 vs 2)
+        Slot 5: Fork B overtakes (depth 4 vs 3) → 3-block deep reorg
 
     Expected Behavior
     -----------------
@@ -226,7 +226,12 @@ def test_three_block_deep_reorg(
     Reorg Details:
         - **Depth**: 3 blocks (deepest in this test suite)
         - **Trigger**: Alternative fork becomes longer
-        - **Weight advantage**: 4 proposer attestations vs 3
+
+    Validator Configuration
+    -----------------------
+    Uses 6 validators to ensure each slot has a unique proposer (slot % 6).
+    Competing blocks at the same slot share a proposer, but the later block's
+    attestation overwrites the earlier one. Fork B wins via depth advantage.
 
     Why This Matters
     ----------------
@@ -245,6 +250,7 @@ def test_three_block_deep_reorg(
     about chain history, ensuring safety and liveness even in adversarial scenarios.
     """
     fork_choice_test(
+        anchor_state=generate_pre_state(num_validators=6),
         steps=[
             # Common base
             BlockStep(
@@ -656,13 +662,13 @@ def test_back_and_forth_reorg_oscillation(
     tests fork choice correctness under extreme conditions.
 
     Oscillation Pattern:
-        Slot 2: Fork A leads (1 block) ← head
-        Slot 3: Fork B catches up (1 block each) → tie
-        Slot 4: Fork B extends (2 vs 1) ← head switches to B
-        Slot 5: Fork A extends (2 vs 2) → tie
-        Slot 6: Fork A extends (3 vs 2) ← head switches to A
-        Slot 7: Fork B extends (3 vs 3) → tie
-        Slot 8: Fork B extends (4 vs 3) ← head switches to B
+        Slot 2: Fork A leads (depth 1 vs 0) ← head
+        Slot 2: Fork B created (depth 1 vs 1) → tie, A maintains
+        Slot 3: Fork B extends (depth 2 vs 1) ← head switches to B (REORG #1)
+        Slot 3: Fork A extends (depth 2 vs 2) → tie, B maintains
+        Slot 4: Fork A extends (depth 3 vs 2) ← head switches to A (REORG #2)
+        Slot 4: Fork B extends (depth 3 vs 3) → tie, A maintains
+        Slot 5: Fork B extends (depth 4 vs 3) ← head switches to B (REORG #3)
 
     Expected Behavior
     -----------------
@@ -671,7 +677,13 @@ def test_back_and_forth_reorg_oscillation(
     3. All reorgs are 1-2 blocks deep
     4. Fork choice remains consistent and correct throughout
 
-    Reorg Count: 3 reorgs in 6 slots (very high rate)
+    Reorg Count: 3 reorgs in 4 slots (very high rate)
+
+    Validator Configuration
+    -----------------------
+    Uses 6 validators to ensure each slot has a unique proposer (slot % 6).
+    Competing blocks at the same slot share a proposer, but the later block's
+    attestation overwrites the earlier one. Fork B ultimately wins via depth.
 
     Why This Matters
     ----------------
@@ -694,6 +706,7 @@ def test_back_and_forth_reorg_oscillation(
     convergence.
     """
     fork_choice_test(
+        anchor_state=generate_pre_state(num_validators=6),
         steps=[
             # Common base
             BlockStep(

From 7da2e388c9682457c3ae2e14c06eef4eb5a2fc81 Mon Sep 17 00:00:00 2001
From: unnawut <unnawut@unnawut.com>
Date: Fri, 16 Jan 2026 17:02:35 +0700
Subject: [PATCH 3/3] fix: remove unnecessary additional comments

---
 .../devnet/fc/test_fork_choice_reorgs.py      | 34 ++++++-------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/tests/consensus/devnet/fc/test_fork_choice_reorgs.py b/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
index b7c25b68..67726257 100644
--- a/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
+++ b/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
@@ -210,10 +210,10 @@ def test_three_block_deep_reorg(
     - Slots 2-5: Fork B slowly builds, then surpasses with 4 blocks
 
     Timeline:
-        Slot 2: Fork A leads (depth 1 vs 0)
-        Slot 3: Fork A leads (depth 2 vs 1)
-        Slot 4: Fork A leads (depth 3 vs 2)
-        Slot 5: Fork B overtakes (depth 4 vs 3) → 3-block deep reorg
+        Slot 2: Fork A leads (1 vs 0)
+        Slot 3: Fork A leads (2 vs 1)
+        Slot 4: Fork A leads (3 vs 2)
+        Slot 5: Fork B overtakes (4 vs 3) → 3-block deep reorg
 
     Expected Behavior
     -----------------
@@ -227,12 +227,6 @@ def test_three_block_deep_reorg(
         - **Depth**: 3 blocks (deepest in this test suite)
         - **Trigger**: Alternative fork becomes longer
 
-    Validator Configuration
-    -----------------------
-    Uses 6 validators to ensure each slot has a unique proposer (slot % 6).
-    Competing blocks at the same slot share a proposer, but the later block's
-    attestation overwrites the earlier one. Fork B wins via depth advantage.
-
     Why This Matters
     ----------------
     Deep reorgs (3+ blocks) are rare in healthy networks but can happen:
@@ -662,13 +656,13 @@ def test_back_and_forth_reorg_oscillation(
     tests fork choice correctness under extreme conditions.
 
     Oscillation Pattern:
-        Slot 2: Fork A leads (depth 1 vs 0) ← head
-        Slot 2: Fork B created (depth 1 vs 1) → tie, A maintains
-        Slot 3: Fork B extends (depth 2 vs 1) ← head switches to B (REORG #1)
-        Slot 3: Fork A extends (depth 2 vs 2) → tie, B maintains
-        Slot 4: Fork A extends (depth 3 vs 2) ← head switches to A (REORG #2)
-        Slot 4: Fork B extends (depth 3 vs 3) → tie, A maintains
-        Slot 5: Fork B extends (depth 4 vs 3) ← head switches to B (REORG #3)
+        Slot 2: Fork A leads (1 vs 0) ← head
+        Slot 2: Fork B created (1 vs 1) → tie, A maintains
+        Slot 3: Fork B extends (2 vs 1) ← head switches to B (REORG #1)
+        Slot 3: Fork A extends (2 vs 2) → tie, B maintains
+        Slot 4: Fork A extends (3 vs 2) ← head switches to A (REORG #2)
+        Slot 4: Fork B extends (3 vs 3) → tie, A maintains
+        Slot 5: Fork B extends (4 vs 3) ← head switches to B (REORG #3)
 
     Expected Behavior
     -----------------
@@ -679,12 +673,6 @@ def test_back_and_forth_reorg_oscillation(
 
     Reorg Count: 3 reorgs in 4 slots (very high rate)
 
-    Validator Configuration
-    -----------------------
-    Uses 6 validators to ensure each slot has a unique proposer (slot % 6).
-    Competing blocks at the same slot share a proposer, but the later block's
-    attestation overwrites the earlier one. Fork B ultimately wins via depth.
-
     Why This Matters
     ----------------
     While extremely rare, this scenario can theoretically occur: