Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 27, 2025 0:58am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 27, 2025 0:58am
rivet-inspector Ignored Ignored Preview Nov 27, 2025 0:58am
rivet-site Ignored Ignored Preview Nov 27, 2025 0:58am

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 18, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3483

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3483

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3483

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3483

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3483

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3483

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3483

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3483

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3483

commit: fa56065

@claude
Copy link

claude bot commented Nov 18, 2025

Pull Request Review

Overview

This PR simplifies the runner workflow by removing the command/event handling logic from the runner workflow and moving event forwarding directly to actor workflows. This is a significant refactoring that reduces ~500 lines of code by eliminating intermediate state management.

Key Changes

  1. Event Forwarding Architecture Change

    • Before: Events were sent to the runner workflow, stored in state, then forwarded to actor workflows
    • After: Events are forwarded directly from ws_to_tunnel_task.rs to actor workflows using the actor_id tag
    • This eliminates the need for the runner workflow to track last_event_idx, last_event_ack_idx, and event storage
  2. Removed State Management (runner2.rs:29-40)

    • Removed last_event_idx, last_command_idx, and commands vector from State
    • Simplified LifecycleState by removing event tracking fields
    • This reduces memory overhead and workflow complexity
  3. Removed Activities

    • InsertDb: Database initialization logic removed (~160 lines)
    • ProcessInit: Init packet processing removed (~80 lines)
    • InsertEvents: Event storage removed (~25 lines)
    • InsertCommands: Command storage and batching removed (~35 lines)
  4. Signal Changes

    • Removed Command signal entirely
    • Kept Forward, CheckQueue, and Stop signals
    • Forward now only handles ToServerStopping and rejects other message types

Code Quality & Best Practices

Good:

  • Follows error handling conventions with anyhow and custom RivetError types
  • Proper use of structured logging (tracing::warn!(?actor_id, ...))
  • Consistent naming conventions (snake_case, past tense for timestamps)
  • Good use of workspace dependencies
  • Clear separation of concerns by moving event handling closer to where it's needed

⚠️ Concerns:

  1. Missing Actor ID in Event Forwarding (ws_to_tunnel_task.rs:341-356)

    protocol::ToServer::ToServerEvents(events) => {
        let res = ctx.signal(pegboard::workflows::runner2::Forward {
            inner: protocol::ToServer::try_from(msg)
                .context("failed to convert message for workflow forwarding")?,
        })
        .tag("actor_id", actor_id)  // ❌ ERROR: 'actor_id' not defined

    Issue: The variable actor_id is not defined in this scope. You need to extract the actor_id from the events first.

    Fix: Extract actor_id from the events before using it:

    protocol::ToServer::ToServerEvents(events) => {
        for event in &events.events {
            let actor_id = crate::utils::event_actor_id(&event.inner).to_string();
            let res = ctx.signal(pegboard::workflows::actor::Event {
                inner: event.inner.clone(),
            })
            .tag("actor_id", &actor_id)
            .graceful_not_found()
            .send()
            .await?;
            if res.is_none() {
                tracing::warn!(?actor_id, "failed to send signal to actor workflow, likely already stopped");
            }
        }
    }
  2. TODO Comment Without Implementation Plan (runner2.rs:79)

    // TODO: Ack events

    Question: How will events be acknowledged now that the runner workflow no longer tracks them? This could lead to unbounded event queues or lost events if not handled properly.

  3. Potential Lost Init Handling (runner2.rs:98-108)
    The workflow now rejects ToServerInit messages, but there's no clear indication of where init logic has moved to. The init activity only writes to RunnerByKeyKey, which is much simpler than before.

    Question: Was the metadata and prepopulated actor names handling removed intentionally, or does it need to be reimplemented elsewhere?

  4. Variable Naming Improvement (conn.rs:79,89)
    ✅ Good change renaming packet to init_packet for clarity

Potential Bugs

CRITICAL: Compilation Error

  • Line 346 in ws_to_tunnel_task.rs: actor_id is undefined

⚠️ Possible Logic Issues:

  1. Event Acknowledgment: Without event tracking in the runner workflow, how are events acknowledged back to the client? This could cause the client to resend events unnecessarily.

  2. Command Flow: The Command signal was removed entirely. Need to verify that actor start commands are still being sent through an alternative path.

  3. Race Condition Handling: The old code had logic to prevent scheduling actors to draining runners (removed code around line 110-140 of the diff). Verify this is now handled elsewhere.

Performance Considerations

Improvements:

  • Reduced workflow state size (no more storing commands/events in memory)
  • Eliminated intermediate event storage and forwarding overhead
  • Simpler signal processing loop should have better throughput

⚠️ Potential Concerns:

  • Events are now forwarded individually rather than batched - could increase signal overhead if many events arrive simultaneously
  • Consider implementing batching in the new architecture if event volume is high

Security Concerns

✅ No new security issues introduced

  • Existing validation logic for actor ownership is preserved in KV request handling
  • Error messages don't leak sensitive information

Test Coverage

⚠️ Recommendations:

  1. Add integration tests for the new event forwarding path to ensure events reach actor workflows correctly
  2. Test the timeout/expiry behavior to ensure actors are properly marked as lost when runners disconnect
  3. Verify the draining behavior works correctly without the intermediate command handling
  4. Test the init flow to ensure runners can still connect and initialize properly

Additional Questions

  1. Was the InsertDb activity's functionality moved elsewhere, or is runner initialization now handled differently?
  2. How are commands being sent to runners now that the Command signal and command batching logic are removed?
  3. What's the plan for event acknowledgment mentioned in the TODO comment?

Summary

This is a valuable simplification that removes significant complexity from the runner workflow. However, there's a critical compilation error that needs to be fixed before this can be merged. Additionally, the missing event acknowledgment logic needs to be clarified or implemented.

Recommendation: Request changes to fix the compilation error and clarify the event acknowledgment and command flow architecture.


Review generated following CLAUDE.md conventions for error handling, logging patterns, and code style.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

Overview

This PR significantly simplifies the runner workflow by removing event and command tracking logic. The changes reduce the workflow state and delegate more responsibility to actor workflows.

Key Changes Analysis

1. Removed Event/Command State Management

The PR removes:

  • Event tracking and acknowledgment logic (last_event_idx, last_event_ack_idx)
  • Command buffering and indexing (last_command_idx, commands vector)
  • Associated activities: ProcessInit, InsertEvents, InsertCommands, InsertDb

Positive: This is a good simplification that reduces state size and complexity.

2. Event Forwarding Moved to WebSocket Handler

In ws_to_tunnel_task.rs:347-362:

protocol::ToServer::ToServerEvents(events) => {
    let res = ctx.signal(pegboard::workflows::runner2::Forward {
        inner: protocol::ToServer::try_from(msg)
            .context("failed to convert message for workflow forwarding")?,
    })
    .tag("actor_id", actor_id)
    .graceful_not_found()
    // ...
}

Issue Found: This code references actor_id on lines 352 and 357, but actor_id is not defined in scope. This will cause a compilation error.

Looking at the protocol, ToServerEvents likely contains a list of events, and you need to extract the actor_id from each event. The original code had crate::utils::event_actor_id(&event.inner).

Recommendation: You need to iterate over the events and extract actor_id properly:

protocol::ToServer::ToServerEvents(events) => {
    for event in &events.events {
        let actor_id = crate::utils::event_actor_id(&event.inner);
        let res = ctx.signal(pegboard::workflows::actor::Event {
            inner: event.inner.clone(),
        })
        .to_workflow::<pegboard::workflows::actor::Workflow>()
        .tag("actor_id", &actor_id.to_string())
        .graceful_not_found()
        .send()
        .await
        .with_context(|| {
            format!("failed to forward signal to actor workflow: {}", actor_id)
        })?;
        if res.is_none() {
            tracing::warn!(?actor_id, "failed to send signal to actor workflow, likely already stopped");
        }
    }
}

3. Removed Init Message Handling 🔍

The runner workflow no longer processes ToServerInit packets, which previously:

  • Populated actor names
  • Stored runner metadata
  • Inserted runner into database indexes
  • Sent ToClientInit response with configuration

Concerns:

  1. Where does database insertion happen now? The InsertDb activity that wrote critical runner state (namespace_id, name, key, version, slots, indexes) has been removed. The Init activity only writes to RunnerByKeyKey.
  2. Runner allocation index: The removed code populated RunnerAllocIdxKey which is used by AllocatePendingActors. Without this, new runners may not be discoverable for actor allocation.
  3. Missing client response: Runners expect a ToClientInit response. Where is this sent now?

Recommendation: Verify that runner initialization is complete elsewhere, or this may break runner connectivity and actor allocation.

4. Simplified Runner Workflow State

LifecycleState now only tracks:

struct LifecycleState {
    draining: bool,
}

Positive: Much cleaner state management.

5. Variable Naming Improvement

In conn.rs:67-77: Renamed packet to init_packet for clarity. Good improvement.

6. Removed Signal Type

Removed Command signal variant since commands are no longer buffered by the runner workflow.

Potential Issues

  1. 🔴 CRITICAL: Compilation Error

    • Line 352 & 357 in ws_to_tunnel_task.rs: actor_id is undefined
    • This will fail to compile
  2. 🟡 WARNING: Missing Runner Initialization

    • Database insertion logic removed but unclear where it is replaced
    • Runner allocation index population missing
    • ToClientInit response not sent
  3. 🟡 WARNING: Event Acknowledgment

    • Line 79 has // TODO: Ack events comment
    • Without acking, runners may resend events indefinitely
  4. 🟢 INFO: Event Batch Size Constant

    • EVENT_ACK_BATCH_SIZE constant on line 16 is now unused (can be removed)

Testing Recommendations

  1. Compile test: Verify the code compiles (currently blocked by undefined actor_id)
  2. Integration test: Test full runner connection flow:
    • Runner connects and sends ToServerInit
    • Verify runner appears in allocation indexes
    • Verify actors can be allocated to the runner
  3. Event forwarding test: Verify actor events are correctly forwarded to actor workflows
  4. Graceful degradation: Test what happens when actor workflow does not exist

Security Considerations

No security issues identified. The changes maintain proper actor ownership validation in KV operations.

Performance Considerations

Positive: Removing state buffering reduces memory usage and workflow replay time. The simplified state should improve workflow performance.

Code Quality

  • Follow CLAUDE.md conventions ✅
  • Proper error handling with anyhow ✅
  • Structured logging with tracing ✅
  • Lowercase log messages ✅

Summary

This is a good architectural simplification, but has critical compilation errors that must be fixed before merging. The undefined actor_id variable needs to be resolved, and the initialization flow needs verification to ensure runners are properly registered in the database.

Recommendation: Fix the compilation error, verify the init flow is complete, and add tests before merging.

@claude
Copy link

claude bot commented Nov 19, 2025

Critical Bug Found: The variable actor_id is referenced on line 352 of ws_to_tunnel_task.rs but is never defined. This will cause a compilation error in the ToServerEvents match arm. The events need to be iterated individually and each event actor_id extracted to forward to the correct actor workflow.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review Summary - Part 1:

This PR significantly simplifies the runner workflow by removing ~500 lines of code. Major changes include:

MUST FIX:

  1. Compilation error with undefined actor_id variable in ws_to_tunnel_task.rs:352
  2. Remove unused EVENT_ACK_BATCH_SIZE constant (line 15-16 of runner2.rs)
  3. Remove unused CommandRow struct (line 43-48 of runner2.rs)

CODE QUALITY ISSUES:

  • Missing event acknowledgment implementation (TODO at line 79)
  • Events not being acknowledged to runners could cause message buildup
  • Loss of backpressure mechanisms

ARCHITECTURAL CONCERNS:

  • Removed tracking of event/command indices (last_event_idx, last_event_ack_idx, last_command_idx)
  • Removed storage of commands and events in workflow state
  • Removed InsertDb, ProcessInit, InsertEvents, InsertCommands activities (300+ lines)
  • Removed Command signal from Main join_signal

Questions to address:

  • Where is event/command state now tracked?
  • How do runners recover from disconnections?
  • How is ordering guaranteed?
  • Is this a breaking change?

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review Summary - Part 2:

PERFORMANCE CONSIDERATIONS:

Positive:

  • Significantly reduced workflow state size
  • Fewer database transactions per message
  • Better separation of concerns (events route directly to actors)

Concerns:

  • Event forwarding will need to iterate events individually
  • Missing batch acknowledgment logic

TEST COVERAGE:
No test changes included. For a refactoring this large, should add:

  1. Integration tests for new event routing path
  2. Runner reconnection scenario tests
  3. Command delivery verification
  4. Concurrent event handling tests

POSITIVE CHANGES:

  • Good naming improvement: packet renamed to init_packet (conn.rs:67)
  • Cleaner separation between runner and actor workflows
  • Simplified workflow logic

Overall: This is a valuable simplification but needs the compilation fix and testing before merge. The architectural changes need documentation to help understand the new event flow and migration path for existing runners.

@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from fefb6f0 to c088d42 Compare November 20, 2025 03:07
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 4294775 to 5458e50 Compare November 20, 2025 03:07
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 03:07 Inactive
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed this PR which simplifies the runner workflow. Here's my detailed feedback:

🎯 Overall Assessment

This is a substantial refactoring that introduces a new protocol version (v4/mk2) and simplifies the runner workflow architecture. The changes are generally well-structured, but there are several areas that need attention.


Strengths

  1. Good Protocol Versioning Strategy: The separation between mk1 and mk2 protocols using is_mk2() helper is clean and allows for graceful migration.

  2. Event Batching: The new ActorEventDemuxer properly batches events (up to 1024) which should improve performance - good optimization.

  3. Proper Shutdown Handling: The shutdown() method on ActorEventDemuxer properly awaits all spawned tasks, preventing potential data loss.


🔴 Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs

The refactored code has critical issues that will prevent compilation:

Lines ~45-65: The recv_msg() and handle_message_mk2() functions reference undefined variables:

  • tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx are not in scope
  • ctx, conn, init_tx, event_demuxer are not in scope in the refactored functions
  • Function signatures don't match their usage

The functions appear to have been extracted but weren't properly parameterized. Example:

async fn recv_msg() -> Result<std::result::Result<Option<()>, LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // ❌ tunnel_sub not in scope

Fix: Add proper parameters to these functions or inline them back into the main task loop.

2. Missing Signal Type in conn.rs:149

ctx.signal(Init);  // ❌ What is Init?

This signal type doesn't appear to be defined anywhere in the visible code. This will fail to compile.

3. Undefined Helper Function in lib.rs

Line 6 references protocol::is_new() but the protocol module only exports is_mk2(). This function doesn't exist:

let workflow_id = if protocol::is_new(protocol_version) {  // ❌ Undefined function

Should probably be protocol::is_mk2().

4. Incorrect Return Type in tunnel_to_ws_task.rs

async fn recv_msg() -> Result<std::result::Result<Option<()>, LifecycleResult>> {
    // ...
    Ok(Some(data))  // ❌ Returns data but signature says Option<()>
}

The function signature says it returns Option<()> but it's actually returning data (a Vec<u8>).


⚠️ High Priority Issues

5. Potential Memory Leak in ActorEventDemuxer

File: actor_event_demuxer.rs:36-67

The last_seen timestamp is set when channels are created but never updated when events are ingested. This means:

  • Active channels will be garbage collected after 30 seconds even if they're receiving events
  • The TODO comment on line 77 about aborting being safe suggests uncertainty

Fix:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get_mut(&actor_id) {
        channel.last_seen = Instant::now();  // ← Add this
        let _ = channel.tx.send(event);

6. Silent Error Handling

File: actor_event_demuxer.rs:38

let _ = channel.tx.send(event);  // ❌ Silently ignores send errors

If the channel is full or closed, events are silently dropped. This could lead to hard-to-debug issues.

Recommendation: At minimum, log when sends fail:

if let Err(err) = channel.tx.send(event) {
    tracing::warn!(?actor_id, ?err, "failed to send event to actor channel");
}

7. Missing Protocol Version in runner2.rs InsertDbInput

File: runner2.rs:289-299

The InsertDbInput struct includes protocol_version field (line 295), but when this struct is instantiated in the workflow (lines 74-82), the protocol_version field is not provided. This will cause a compilation error.

Fix: Add protocol_version: input.protocol_version to the struct initialization.

8. Incorrect Variable Reference in ping_task.rs

File: ping_task.rs:38

let wf = if protocol::is_mk2(conn.protocol_version) {  // ✓ Correct
    // ...
} else {
    // ...
};

// Later on line 75:
if protocol::is_mk2(conn.protocol_version) {  // ❌ Should be consistent

The check uses conn.protocol_version but the function definition suggests it should use the workflow's stored protocol version. Verify this is intentional.


📋 Code Quality Issues

9. Inconsistent Error Handling Pattern

The error handling varies between graceful degradation and hard failures without clear rationale:

  • actor_event_demuxer.rs:119: Logs warning but continues
  • tunnel_to_ws_task.rs: Returns error immediately on parse failure (after fix)

Recommendation: Document the error handling strategy - when to fail fast vs. gracefully degrade.

10. Missing Documentation

The new ActorEventDemuxer is a core component but lacks module-level documentation explaining:

  • Why batching is beneficial
  • The GC strategy and its implications
  • Thread safety considerations

🔍 Potential Bugs

11. Race Condition in Initialization

File: ws_to_tunnel_task.rs:46-55 and tunnel_to_ws_task.rs:23-32

For mk2 protocol, tunnel_to_ws task waits for init_rx signal before processing, while ws_to_tunnel sends it. However:

  • What happens if the init signal is sent before the tunnel_to_ws task starts waiting?
  • watch::channel(()) doesn't buffer, so early sends might be lost

Recommendation: Verify the initialization sequence is deterministic or use a different synchronization primitive.

12. Incomplete Migration Path

The code supports both mk1 (v3) and mk2 (v4) protocols but:

  • No migration strategy documented
  • No metrics/logging to track which protocol versions are in use
  • No deprecation timeline for mk1

Recommendation: Add telemetry to track protocol version usage.


🧪 Testing Concerns

  • Missing Test Coverage: This is a major refactor with no apparent test additions
  • Protocol Compatibility: No evidence of tests validating mk1/mk2 compatibility
  • Error Paths: Edge cases like GC during active event processing aren't tested

Recommendation: Add integration tests covering:

  1. Protocol version negotiation
  2. ActorEventDemuxer GC behavior
  3. Graceful shutdown sequences

📝 Minor Issues

  1. Typo in Function Name: is_mk2 could be more descriptive as is_protocol_v4_or_later
  2. Magic Numbers: EVENT_ACK_BATCH_SIZE = 500 and batch size 1024 in ActorEventDemuxer should be constants with explanatory comments
  3. Inconsistent Naming: mk2 vs v4 - pick one naming convention

🎬 Action Items

Before merging:

  1. ✅ Fix all compilation errors (issues [SVC-2555] Set up issue templates #1-4)
  2. ✅ Fix the memory leak in ActorEventDemuxer ([SVC-2358] Enable Redis memory overcommit #5)
  3. ✅ Add protocol_version to InsertDbInput instantiation ([SVC-2404] Improve internal documentation #7)
  4. ⚠️ Improve error handling for silent failures (Remove servers not in salt #6)
  5. ⚠️ Add tests for new functionality
  6. ⚠️ Verify initialization race condition ([SOC-15] Fix error when running user-delete #11)

Future improvements:

  • Add documentation for ActorEventDemuxer
  • Add protocol version metrics
  • Create migration guide for mk1 → mk2

📊 Overall Recommendation

Do not merge yet - there are critical compilation errors that need to be fixed first. Once the compilation issues are resolved and the memory leak is addressed, this refactor looks promising. The architecture simplification is a good direction, but the implementation needs polish.

Please address the critical issues and consider adding tests before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pb): simplify runner wf

I've reviewed this PR which introduces a new protocol version (v4/mk2) for the pegboard runner workflow and adds an event demuxer for actor events.

Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs

  • Line 21: conn.protocol should be conn.protocol_version
  • Lines 36-44, 62-76: recv_msg() and handle_message_mk2() are called without required parameters
  • Line 51: Function signature declares no parameters but references tunnel_sub, eviction_sub, and tunnel_to_ws_abort_rx
  • Line 79: Function signature declares no parameters but uses tunnel_msg, ctx, and conn

These helper functions need to either accept the necessary parameters in their signatures OR be moved inside the main function as closures.

2. Compilation Errors in ws_to_tunnel_task.rs
Similar scoping issues with recv_msg() and handle_message_mk2() not receiving proper parameters.

3. Missing Signal Handling in conn.rs
Line 177 uses ctx.signal(Init) but Init is not defined, imported, or given a workflow target.

High Priority Issues

4. Potential Memory Leak in ActorEventDemuxer
The last_seen field is never updated after initial creation (line 64), so GC will incorrectly remove channels after 30 seconds. Use get_mut to update timestamp on each ingest.

5. Error Handling Silently Drops Send Failures
Line 38 in actor_event_demuxer.rs silently drops events if send fails. Should log warnings.

6. Protocol Version Checking Inconsistencies
conn.rs:150 uses protocol::is_new() which is not defined. Should be is_mk2?

Medium Priority Issues

7. Missing Protocol Version Field
InsertDbInput struct defines protocol_version field but never sets it when created (lines 74-82 of runner2.rs).

8. Incorrect Message Type Usage
Line 185 of runner2.rs uses protocol::ToRunner::ToRunnerClose but for mk2 should be protocol::mk2::ToRunner::ToRunnerClose.

Code Quality

Good:

  • Event batching up to 1024 events is excellent
  • Graceful shutdown waits for in-flight events
  • Proper structured logging

Suggestions:

  • Add metrics for demuxer channel count
  • Add tests for protocol version branching

Summary

Cannot merge as-is due to compilation errors. Priority fixes:

  1. Fix function signatures in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs
  2. Fix or define Init signal in conn.rs
  3. Update last_seen timestamp in demuxer
  4. Add missing protocol_version field to InsertDbInput
  5. Define or remove protocol::is_new()

Once addressed, this is a solid simplification!

@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from c088d42 to 657b813 Compare November 21, 2025 19:57
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 5458e50 to 8431b53 Compare November 21, 2025 19:57
@claude
Copy link

claude bot commented Nov 22, 2025

PR Review: fix(pb): simplify runner wf

This PR introduces significant architectural changes to the Pegboard runner workflow, including a new protocol version ("mk2") and refactoring of the connection handling and event dispatching. Here's my detailed review:


🔴 Critical Issues

1. Compilation Errors - Missing Variables in Scope

Several functions reference variables that are not in scope, indicating the code is incomplete or won't compile:

tunnel_to_ws_task.rs:

  • recv_msg() (line 52-78) references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx but these are not passed as parameters
  • handle_message_mk2() (line 80) and handle_message_mk1() (line 166) reference tunnel_msg, ctx, conn without parameters
  • Line 201 in handle_message_mk1() uses continue outside a loop context

ws_to_tunnel_task.rs:

  • recv_msg() (line 79-117) references ws_rx, eviction_sub2, ws_to_tunnel_abort_rx without them being parameters
  • handle_message_mk2() (line 120-125) has incorrect parameter msg: () type but uses data variable inside
  • handle_message_mk1() (line 467) references data which is not a parameter
  • Line 395-403: handle_message_mk2 references undefined input, last_command_idx, prepopulate_actor_names, metadata, runner_lost_threshold
  • Lines 415-434: Uses res_msg and res_msg_serialized instead of msg and msg_serialized

2. Incomplete Implementation

  • ack_commands() function (line 761-768 in ws_to_tunnel_task.rs) has empty body with commented-out code
  • Multiple commented-out // TODO: Send message to tunnel blocks in actor/mod.rs and actor/runtime.rs indicate unfinished migration

3. Missing Signal Struct

  • conn.rs line 177-178: ctx.signal(Init); - The Init signal is referenced but the struct import/definition is not shown. Should likely be ctx.signal(pegboard::workflows::runner2::Init {}).

🟡 Potential Bugs

1. ActorEventDemuxer - Event Loss on First Ingest

In actor_event_demuxer.rs line 36-67, when a new actor channel is created, the initial event is never sent to the new channel:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);  // Existing channel - event sent
    } else {
        // New channel created... but `event` is never sent!
        let (tx, mut rx) = mpsc::unbounded_channel();
        // ...
        self.channels.insert(actor_id, Channel { tx, handle, last_seen: Instant::now() });
    }
    // ... gc code
}

The first event for each actor will be dropped.

2. GC Never Updates last_seen

The last_seen field in Channel struct is set at creation time but never updated when events are received. This means active channels may be incorrectly garbage collected after 30 seconds.

3. Missing ToRunnerClose Rename

In pegboard-serverless/src/lib.rs line 523, the code was changed to use ToRunnerClose but this appears to be a protocol rename. Need to verify the protocol definition matches.


🟠 Code Quality Issues

1. Massive Code Duplication

handle_message_mk1 and handle_message_mk2 in ws_to_tunnel_task.rs share ~90% identical code (KV operations, error handling). Consider extracting common logic into shared functions.

2. Inconsistent Error Message Context

In handle_message_mk2 lines 415-434, error contexts say "failed to serialize KV delete response" for non-KV-delete operations:

let msg_serialized = res_msg
    .serialize(conn.protocol_version)
    .context("failed to serialize KV delete response")?;  // Wrong context!

3. Protocol Version Checking Functions

Multiple functions like is_mk2(), is_new() are used inconsistently throughout the codebase. Consider documenting the protocol version scheme more clearly.


🔵 Architecture Observations

  1. Protocol Versioning Strategy: The PR introduces a parallel protocol path (mk1 vs mk2). This is a reasonable approach for backwards compatibility, but ensure there's a clear deprecation timeline.

  2. Event Demuxer Pattern: The ActorEventDemuxer with batched event dispatch is a good pattern for reducing signal overhead.

  3. Init Synchronization: Using watch::channel for init synchronization between tasks is appropriate.


📋 Recommendations

  1. Fix all compilation errors before merging - the functions need proper parameter passing
  2. Complete the ack_commands() implementation or remove it if not needed
  3. Fix the ActorEventDemuxer to send the first event and update last_seen on each ingest
  4. Add tests for the new protocol version handling and event demuxer
  5. Extract shared KV handling code to reduce duplication
  6. Address all TODO comments for sending messages to tunnel

Summary

This PR has significant architectural value in simplifying the runner workflow, but is not ready to merge due to compilation errors and incomplete implementations. The code structure is good, but needs completion of the migration work.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 22, 2025

PR Review: fix(pb): simplify runner wf

Summary

This PR introduces significant changes to the Pegboard runner system, introducing a new protocol version (v4/mk2) alongside maintaining backwards compatibility with v1-v3 (mk1). The main goal appears to be simplifying the runner workflow by moving more logic into the pegboard-runner service rather than the workflow itself.

Critical Issues

1. Compilation Errors - Missing Variables in Closures

Several functions in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs reference variables like ctx, conn, tunnel_msg, data that are not passed as parameters or captured:

tunnel_to_ws_task.rs:52-77 - recv_msg() references tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx without them being passed:

async fn recv_msg() -> Result<std::result::Result<(), LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // tunnel_sub not in scope

tunnel_to_ws_task.rs:80-163 - handle_message_mk2() references ctx, conn, tunnel_msg without parameters:

async fn handle_message_mk2() -> Result<()> {
    // Parse message
    let msg = match versioned::ToRunner2::deserialize_with_embedded_version(&tunnel_msg.payload) {
        // tunnel_msg not in scope, ctx not in scope

ws_to_tunnel_task.rs:79-117 - Same issue with recv_msg() referencing ws_rx, eviction_sub2, ws_to_tunnel_abort_rx

ws_to_tunnel_task.rs:119-464 - handle_message_mk2() has function signature msg: () but references data, conn, ctx, init_tx, etc.

ws_to_tunnel_task.rs:467-759 - handle_message_mk1() similarly references undefined variables

2. Incomplete Function Body

ws_to_tunnel_task.rs:761-768 - ack_commands() has an empty/incomplete body:

async fn ack_commands(ctx: &StandaloneCtx) -> Result<()> {
    // ctx.udb()?.run(|| {
    //     let last_ack = ;
    //     let stream = .read_ranges_keyvalues({
    //         limit:
    //     });
    // }).await?;
}

This function is called but does nothing and doesn't return properly.

3. Missing Signal Definition

conn.rs:177-178 - Init signal is referenced but appears undefined:

if protocol::is_new(protocol_version) {
    ctx.signal(Init); // Init is not defined anywhere visible

4. Use of continue in Non-Loop Context

tunnel_to_ws_task.rs:201 - continue statement inside a function, not a loop:

protocol::ToRunner::ToRunnerPing(ping) => {
    // ...
    continue; // This is inside handle_message_mk1, not a loop
}

Code Quality Issues

5. Missing last_seen Update in ActorEventDemuxer

actor_event_demuxer.rs:36-67 - When an event is ingested for an existing channel, last_seen is never updated, which could cause premature GC:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);
        // Missing: channel.last_seen = Instant::now();
    }

6. Event Not Sent on First Ingest

actor_event_demuxer.rs:36-67 - When a new channel is created, the initial event is never sent to it. The code creates the channel but doesn't send the first event:

} else {
    let (tx, mut rx) = mpsc::unbounded_channel();
    // ... spawn task ...
    self.channels.insert(actor_id, Channel { tx, handle, last_seen: Instant::now() });
    // Missing: tx.send(event) for the initial event
}

7. Return Type Mismatch

tunnel_to_ws_task.rs:52 - recv_msg() returns Result<std::result::Result<(), LifecycleResult>> but line 77 returns Ok(Ok(tunnel_msg)) where tunnel_msg is not ().

ws_to_tunnel_task.rs:79 - Similar issue - returns Result<std::result::Result<Option<()>, LifecycleResult>> but line 106 returns Ok(Some(data)) where data is Bytes.

8. Inconsistent Protocol Function Naming

lib.rs:14-16 uses is_mk2() but conn.rs:130 uses is_new():

// lib.rs
pub fn is_mk2(protocol_version: u16) -> bool { ... }

// conn.rs 
if protocol::is_new(protocol_version) { // is_new doesn't exist

Missing Return Statements

ws_to_tunnel_task.rs:758-759 - handle_message_mk1 doesn't return anything despite being declared as returning Result<()>

Protocol Design Concerns

9. Missing ToClientAckEvents and ToClientKvResponse in mk2

The v4 schema includes these in ToClient, but handle_message_mk2 in tunnel_to_ws_task.rs only handles ToClientCommands and ToClientTunnelMessage. The mk2 protocol may need handlers for ack events and KV responses.

Recommendations

  1. Fix compilation errors - The extracted helper functions need their required context passed as parameters
  2. Complete ack_commands() implementation or remove the call
  3. Add the Init signal definition or import
  4. Fix last_seen updates in the demuxer to prevent premature garbage collection
  5. Send initial event when creating a new channel in the demuxer
  6. Fix return types to match actual return values
  7. Consider adding tests - This is a significant refactor with new protocol version; tests would help verify behavior

Overall Assessment

This PR introduces important architectural changes but appears to be in an incomplete state. The code has significant compilation issues that would prevent it from building. I recommend addressing the critical issues before merging.


🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 25, 2025

PR Review: fix(pb): simplify runner wf

Summary

This PR refactors the Pegboard runner workflow to simplify its architecture, adding protocol v4 (mk2) support while maintaining backward compatibility with v3 (mk1). The changes introduce a new ActorEventDemuxer for batched event processing and restructure message handling flows.

Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs

Location: engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:79-163, engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:78-467

Issue: Multiple function signatures and implementations have critical errors:

  • recv_msg() functions are missing required parameters (closures over tunnel_sub, eviction_sub, ws_rx, etc. but don't capture them)
  • handle_message_mk2() and handle_message_mk1() have incorrect signatures:
    • Missing required parameters like ctx, conn, tunnel_msg/data
    • Wrong parameter types (e.g., msg: () instead of msg: Vec<u8>)
    • These functions reference variables that aren't in scope

Example from tunnel_to_ws_task.rs:52:

async fn recv_msg() -> Result<std::result::Result<(), LifecycleResult>> {
    let tunnel_msg = tokio::select! {
        res = tunnel_sub.next() => { // ERROR: tunnel_sub not in scope

Example from tunnel_to_ws_task.rs:79:

async fn handle_message_mk2() -> Result<()> {  // ERROR: Missing parameters
    let msg = match versioned::ToRunner2::deserialize_with_embedded_version(&tunnel_msg.payload) {
        // ERROR: tunnel_msg not in scope

Similar issues in ws_to_tunnel_task.rs:120:

async fn handle_message_mk2(
    ctx: &StandaloneCtx,
    conn: &Arc<Conn>,
    init_tx: &watch::Sender<()>,
    event_demuxer: &mut ActorEventDemuxer,
    msg: (),  // ERROR: Should be Vec<u8> or similar
) -> Result<()> {
    let msg = match versioned::ToServerMk2::deserialize(&data, conn.protocol_version) {
        // ERROR: data not in scope

Impact: The code will not compile. This appears to be an incomplete refactoring.

Recommendation: These helper functions need proper parameter passing or should be refactored back into the main loop with proper closures.

2. Missing Signal Handler in conn.rs

Location: engine/packages/pegboard-runner/src/conn.rs:177-179

Issue:

if protocol::is_new(protocol_version) {
    ctx.signal(Init);  // ERROR: Init is not defined
} else {

The Init signal type is referenced but never defined or imported. This will cause a compilation error.

Recommendation: Define the Init signal type or use the correct signal from the workflow module.

3. Incorrect Protocol Version Check Function Names

Location: Multiple files use protocol::is_new() and protocol::is_mk2()

Issue: The protocol library at engine/sdks/rust/runner-protocol/src/lib.rs:14-16 only exports is_mk2():

pub fn is_mk2(protocol_version: u16) -> bool {
    protocol_version > PROTOCOL_MK1_VERSION
}

But conn.rs:130 and conn.rs:177 reference protocol::is_new() which doesn't exist.

Recommendation: Use consistent naming - either all is_mk2() or add the is_new() helper.

Major Issues

4. ActorEventDemuxer - Potential Memory Leak

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:36-67

Issue: When a new event arrives for an actor, a new channel is created (line 40), but the last_seen timestamp is never updated for existing channels. This means:

  • Channels are created but last_seen stays at their creation time
  • GC will never clean up active channels since their last_seen is never updated
  • Memory leak as channels accumulate

Code:

pub fn ingest(&mut self, actor_id: Id, event: protocol::Event) {
    if let Some(channel) = self.channels.get(&actor_id) {
        let _ = channel.tx.send(event);  // last_seen not updated here!
    } else {
        // Create new channel...
    }

Recommendation: Update last_seen on every ingest:

if let Some(channel) = self.channels.get_mut(&actor_id) {
    let _ = channel.tx.send(event);
    channel.last_seen = Instant::now();
}

5. ActorEventDemuxer - Unsafe Task Abort

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:76-79

Issue: There's a TODO comment acknowledging this is unsafe:

if !keep {
    // TODO: Verify aborting is safe here
    channel.handle.abort();
}

Aborting tasks can leave workflows in inconsistent states, especially if a signal is mid-flight when the abort happens.

Recommendation: Drop the sender and wait for graceful shutdown instead of aborting, similar to what's done in the shutdown() method.

6. Missing Init Signal Send in conn.rs

Location: engine/packages/pegboard-runner/src/conn.rs:177-179

Issue: For mk2 protocol, the code attempts to send an Init signal but:

  • The signal handler is never awaited (no .send().await?)
  • The signal construction is incomplete
  • Should be sending to a specific workflow

Current code:

if protocol::is_new(protocol_version) {
    ctx.signal(Init);  // No .to_workflow_id().send().await?
}

Recommendation: Complete the signal send chain:

if protocol::is_mk2(protocol_version) {
    ctx.signal(pegboard::workflows::runner2::Init {})
        .to_workflow_id(workflow_id)
        .send()
        .await?;
}

7. Race Condition in tunnel_to_ws_task.rs

Location: engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:21-33

Issue: The mk2 protocol path waits for an init signal before processing messages, but there's a potential race:

  • If the init signal arrives before init_rx.changed() is called, it will be missed
  • The watch channel doesn't queue historical changes

Recommendation: Consider using a different synchronization primitive (like tokio::sync::Notify or oneshot) that guarantees the signal is received even if sent before the receiver is waiting.

Minor Issues & Suggestions

8. Inconsistent Error Handling in ActorEventDemuxer

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:38

Issue: Failed sends are silently ignored:

let _ = channel.tx.send(event);

While this might be intentional (actor workflow stopped), it would be better to log a warning since events are being dropped.

Recommendation:

if channel.tx.send(event).is_err() {
    tracing::warn!(?actor_id, "failed to send event to actor channel, likely stopped");
}

9. Hardcoded Constants

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:10-11

Issue:

const GC_INTERVAL: Duration = Duration::from_secs(30);
const MAX_LAST_SEEN: Duration = Duration::from_secs(30);

These values should potentially be configurable via the config system used elsewhere in the codebase.

10. Missing Protocol Version Field in InsertDbInput

Location: engine/packages/pegboard/src/workflows/runner2.rs:289-299

Issue: The InsertDbInput struct has a protocol_version field (line 295) but it's never populated when the activity is called (lines 74-83). This means the value will be uninitialized.

Code:

ctx.activity(InsertDbInput {
    runner_id: input.runner_id,
    namespace_id: input.namespace_id,
    // ... other fields
    // Missing: protocol_version
})

Recommendation: Add protocol_version: input.protocol_version to the activity call.

11. Logging Style Issues

Several logging statements don't follow the codebase style guide (CLAUDE.md):

  • engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:71: Should use structured logging instead of string interpolation
  • Missing ? prefix for error types in several places

12. Duplicate Code in tunnel_to_ws_task.rs

Location: Lines 79-163 (mk2) and 166+ (mk1)

Issue: The gateway pong handling logic is duplicated between mk1 and mk2 handlers. This violates DRY principles and makes maintenance harder.

Recommendation: Extract common gateway pong logic into a separate helper function.

Positive Aspects

  1. Good separation of concerns: The ActorEventDemuxer is a clean abstraction for batching actor events
  2. Backward compatibility: The dual-path approach (mk1/mk2) maintains compatibility while adding new features
  3. Proper shutdown handling: The demuxer shutdown logic ensures in-flight events complete before closing
  4. Clear protocol versioning: The is_mk2() helper makes protocol version checks readable

Test Coverage

The PR doesn't include any tests for:

  • ActorEventDemuxer batching behavior
  • Protocol version negotiation
  • Race conditions in initialization
  • GC behavior of the demuxer

Recommendation: Add unit tests for ActorEventDemuxer and integration tests for protocol version handling.

Summary

This PR has the right architectural direction but requires significant fixes before it can be merged:

Blockers:

  1. Fix compilation errors in function signatures (tunnel_to_ws_task.rs, ws_to_tunnel_task.rs)
  2. Define/import the Init signal type
  3. Fix protocol::is_new() vs protocol::is_mk2() inconsistency
  4. Fix memory leak in ActorEventDemuxer by updating last_seen
  5. Complete the Init signal send chain

Should Fix:
6. Replace unsafe task abort with graceful shutdown
7. Fix race condition in init signal handling
8. Add missing protocol_version field to activity call

Nice to Have:
9. Improve error handling and logging
10. Make constants configurable
11. Add test coverage
12. Deduplicate gateway pong logic

Please address the critical issues before this can be reviewed for merge. Happy to re-review once these are fixed!

@claude
Copy link

claude bot commented Nov 25, 2025

Code Review

This PR simplifies the runner workflow by introducing protocol v4 (mk2) and a new event demuxer pattern. Overall the refactoring looks solid, but there are several critical bugs that need to be fixed before merging.

Critical Issues

1. Compilation Errors in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs

Multiple functions are missing required parameters. The recv_msg() and handle_message functions reference variables not in scope (tunnel_sub, eviction_sub, ctx, conn, event_demuxer, init_tx, etc).

2. Missing Signal Definition (conn.rs:178)

The code references Init signal but it is not defined. Should use pegboard::workflows::runner2::Init.

3. Missing last_seen Update in ActorEventDemuxer (actor_event_demuxer.rs:36-67)

The last_seen timestamp is never updated after initial insertion. All channels will be garbage collected after 30 seconds even if actively receiving events. Use get_mut() and update last_seen on each ingest.

High Priority Issues

4. Protocol Version Check Inconsistency

Uses both is_mk2() and is_new() but is_new() is not defined. Use is_mk2() consistently.

5. Missing Protocol Version Field

InsertDbInput has protocol_version field but it is never populated when called (lines 74-82).

6. Unreachable continue Statement

In handle_message_mk1() there is a continue inside a function (not a loop). Should be return Ok(()).

Medium Priority

7. Unsafe Task Abort - Using abort() can leave inconsistent state
8. Ignoring Send Errors - Events lost without logging
9. Missing Error Context - Signal sends need graceful_not_found()

Low Priority

10. Nested Result Types - Hard to read, consider custom enum
11. Magic Numbers - Add constants for batch size
12. Logging - Include actor_id in error logs

Recommendation

Fix critical issues 1-6 first, then address medium-priority issues before merging.

@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from 6f07705 to 9da2f02 Compare November 26, 2025 02:16
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 8a61739 to a5ec483 Compare November 26, 2025 02:16
@claude
Copy link

claude bot commented Nov 26, 2025

Code Review: PR #3483 - fix(pb): simplify runner wf

I've reviewed this PR which simplifies the runner workflow and introduces protocol version 4. Here are my findings:

Critical Issues 🔴

1. Compilation Errors in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs

Both refactored task files have multiple critical compilation errors:

  • tunnel_to_ws_task.rs:22: References undefined conn.protocol instead of conn.protocol_version
  • tunnel_to_ws_task.rs:37-38,44-45: recv_msg() is called without required parameters (needs tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx)
  • tunnel_to_ws_task.rs:52-78: Function signature doesn't include the variables it references (tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx)
  • tunnel_to_ws_task.rs:80-146: handle_message_mk2() and handle_message_mk1() both reference undefined variables (tunnel_msg, ctx, conn) that aren't in their function signatures
  • ws_to_tunnel_task.rs:62-76,79-109: Similar issues - recv_msg() and handle_message_mk2() reference undefined variables

These errors suggest the refactoring was incomplete. The functions need proper parameter passing.

2. ActorEventDemuxer: Potential Memory Leak

In actor_event_demuxer.rs:37-38:

if let Some(channel) = self.channels.get(&actor_id) {
    let _ = channel.tx.send(event);

The last_seen timestamp is never updated when events are ingested for existing channels. This means:

  • Channels receiving events regularly will still be garbage collected after 30 seconds
  • This defeats the purpose of the GC mechanism
  • Could cause active actor event channels to be aborted unexpectedly

Fix: Update last_seen when sending events:

if let Some(channel) = self.channels.get_mut(&actor_id) {
    channel.last_seen = Instant::now();
    let _ = channel.tx.send(event);

3. Silently Dropped Events

In actor_event_demuxer.rs:38 and throughout the new code, errors from channel.tx.send(event) are silently ignored with let _ =. If the channel is full or closed, events are dropped without logging.

Recommendation: Log failed sends at minimum:

if channel.tx.send(event).is_err() {
    tracing::warn!(?actor_id, "failed to send event to actor channel");
}

4. Protocol Version Helper Function Inconsistency

In lib.rs:14-16:

pub fn is_mk2(protocol_version: u16) -> bool {
    protocol_version > PROTOCOL_MK1_VERSION
}

But in conn.rs:129 and ping_task.rs:39, there's also a reference to protocol::is_new() which isn't defined in the diff. This inconsistency could cause issues.

High Priority Issues 🟠

5. Missing Signal Type in conn.rs

conn.rs:147: ctx.signal(Init) - The Init signal type is not imported or defined anywhere visible in the diff.

6. Unsafe Task Abort with TODO

actor_event_demuxer.rs:76-78:

// TODO: Verify aborting is safe here
channel.handle.abort();

Aborting tasks can leave actors in inconsistent states. This needs verification before merge, or implement graceful shutdown instead.

7. Incorrect Logging Convention

actor_event_demuxer.rs:89: Uses channels=?self.channels.len() which logs with debug formatting. According to CLAUDE.md, structured logging should use channels = self.channels.len() (no ? prefix for simple values).

Medium Priority Issues 🟡

8. Missing Error Context in Protocol Detection

In conn.rs:129-149, the code branches on protocol::is_new(protocol_version) but doesn't handle the case where protocol version is invalid/unsupported. Consider adding validation.

9. Unbounded Channel Risk

actor_event_demuxer.rs:40: Uses mpsc::unbounded_channel(). If event production outpaces consumption, this could lead to unbounded memory growth per actor.

10. Inconsistent Import Patterns

ws_to_tunnel_task.rs:1: Violates CLAUDE.md guideline - imports should be at the top of the file, not inline. The code uses:

use anyhow::Context;

Following CLAUDE.md: "Always add imports at the top of the file inside of inline within the function."

Though the current import is at the top, ensure new code doesn't add inline imports.

11. Potential Integer Overflow

ping_task.rs:25: rtt = now.saturating_sub(pong.ts) then cast to u32. If RTT exceeds u32::MAX, it's set to u32::MAX. Consider logging this edge case.

Positive Aspects ✅

  1. Good separation of concerns - The ActorEventDemuxer provides clean event routing
  2. Batching optimization - Event batching (1024 at a time) should improve performance
  3. Proper cleanup - The demuxer shutdown logic properly awaits all tasks
  4. Protocol versioning - Good backward compatibility structure with separate Mk1/Mk2 types
  5. Graceful workflow not found handling - Uses .graceful_not_found() pattern correctly

Security Considerations 🔒

No major security issues identified. The actor validation in KV requests (checking actor_belongs) is good.

Testing Recommendations 🧪

  1. Add tests for ActorEventDemuxer GC behavior
  2. Test protocol version switching between mk1 and mk2
  3. Test event batching under high load
  4. Test graceful shutdown with in-flight events
  5. Add integration tests for the new runner2 workflow

Summary

This PR cannot be merged in its current state due to critical compilation errors. The refactoring needs to be completed to properly pass parameters to the extracted functions. The ActorEventDemuxer also needs the last_seen update fix to function correctly.

Blocking issues before merge:

  • Fix all compilation errors in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs
  • Fix ActorEventDemuxer last_seen not being updated
  • Define/import the Init signal type
  • Resolve is_new vs is_mk2 naming inconsistency

Would be happy to review again once these are addressed!

@claude
Copy link

claude bot commented Nov 26, 2025

Pull Request Review

I've reviewed this PR and found several critical issues that prevent compilation:

Critical Compilation Errors

tunnel_to_ws_task.rs:

  • Lines 52-78: recv_msg() missing parameters (tunnel_sub, eviction_sub, tunnel_to_ws_abort_rx) that are used in body
  • Lines 80-164: handle_message_mk2() missing ctx, conn, tunnel_msg, data parameters
  • Line 166-252: handle_message_mk1() has same parameter issues
  • Line 201: Invalid continue statement outside loop
  • Line 22: Wrong field conn.protocol should be conn.protocol_version

ws_to_tunnel_task.rs:

  • Lines 79-117: recv_msg() missing parameters
  • Lines 119-464: handle_message_mk2() missing data parameter, wrong msg type
  • Lines 396-439: Contains misplaced workflow code with undefined variables (input, last_command_idx, etc)
  • Lines 415, 428: Uses res_msg variable but should be msg
  • Lines 467-759: handle_message_mk1() missing parameters
  • Lines 378, 738, 771: Calls to non-existent handle_tunnel_message function
  • Line 761-768: Empty ack_commands() stub
  • Line 60: Wrong field conn.protocol_version

actor_event_demuxer.rs:

  • Lines 36-67: First event lost when creating new channel - only sends in Some branch
  • Lines 73-83: GC broken - last_seen never updated so all channels removed after 30s
  • Line 38: Silently drops events on send failure
  • Line 77-78: TODO suggests uncertainty about task abort safety

Other Issues

  • Significant code duplication in KV handling between mk1 and mk2
  • Multiple TODO comments about exposing errors to clients (security concern)
  • No tests mentioned despite protocol version complexity
  • Empty PR description

Recommendations

  1. Fix all compilation errors - code will not build
  2. Fix event demuxer bugs (lost events, broken GC)
  3. Remove misplaced workflow code
  4. Extract duplicate KV code
  5. Add error logging
  6. Add tests for both protocol versions
  7. Make last_seen updatable (use Mutex or refactor)

This needs significant work before merging. The errors suggest it wasn't tested after refactoring.

@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 9893845 to 7dad582 Compare November 27, 2025 00:58
@MasterPtato MasterPtato force-pushed the 11-18-fix_pb_simplify_runner_wf branch from 5d9024f to 246cd19 Compare November 27, 2025 00:58
@claude
Copy link

claude bot commented Nov 27, 2025

Code Review - PR #3483: Simplify Runner Workflow

Summary

This PR refactors the pegboard runner workflow system, introducing a new protocol version (v4/mk2). The changes include a new ActorEventDemuxer for batching actor events and significant restructuring of WebSocket message handling.


Critical Issues (Must Fix Before Merge)

1. Compilation Errors in tunnel_to_ws_task.rs

Location: engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:22-252

  • Line 22: conn.protocol should be conn.protocol_version
  • Lines 52-77: recv_msg() doesn't have access to required variables
  • Lines 80-252: handle_message_mk2() and handle_message_mk1() missing required parameters and reference undefined variables
  • Line 201: continue statement outside a loop

2. Compilation Errors in ws_to_tunnel_task.rs

Location: engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:79-489

  • Lines 79-116: recv_msg() missing parameters
  • Line 125: msg: () should be data: Vec<u8>
  • Lines 137, 289, 337, 370, 408, 437: Wrong enum variant

3. Missing Protocol Version Field

Location: engine/packages/pegboard/src/workflows/runner2.rs:26,295

The Input struct declares protocol_version field but InsertDbInput doesn't include it, yet tries to use it at line 432.

Fix: Add protocol_version: u16 to InsertDbInput struct.

4. last_seen Field Never Updated

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:36-67

The last_seen timestamp is never updated when reusing existing channels. All channels will be GC'd after 30 seconds even if active.

Fix: Use get_mut and update channel.last_seen = Instant::now().


High Priority Issues

5. Event Batching Without Backpressure

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:40

Unbounded channels could lead to memory exhaustion. Use bounded channels.

6. Silent Channel Send Failure

Location: engine/packages/pegboard-runner/src/actor_event_demuxer.rs:38

Failed sends are silently ignored. Add logging for debugging.

7. Error Message Information Disclosure

Location: ws_to_tunnel_task.rs (multiple locations)

Raw error messages are returned to clients via err.to_string(), leaking internal details. Return generic errors to clients and log details server-side.


Positive Observations

  1. Good separation of concerns with ActorEventDemuxer
  2. Proper instrumentation with tracing
  3. Graceful shutdown handling
  4. Clean protocol versioning (mk1/mk2 separation)
  5. Efficient 1024-event batching
  6. Good structured logging

Test Coverage

Missing tests for:

  • ActorEventDemuxer garbage collection
  • Protocol version switching
  • Init signal synchronization
  • Event batching and dispatch

Action Items

Must Fix (Blocking):

  1. Fix compilation errors in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs
  2. Add protocol_version to InsertDbInput
  3. Update last_seen timestamp for existing channels

Should Fix:
4. Add backpressure to event channels
5. Handle channel send failures with logging
6. Fix error message information disclosure

Nice to Have:
7. Add tests for new functionality
8. Standardize protocol version checking helpers


Summary

This is a solid architectural improvement, but has critical compilation errors that must be fixed:

  • Missing function parameters
  • Incorrect field access
  • Missing struct field
  • last_seen never updated

Recommendation: Address compilation errors and critical issues before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants