Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 13, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Member Author

NathanFlurry commented Nov 13, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Hibernating WebSocket Protocol Implementation

Overview

This PR implements a new hibernating WebSocket protocol in RivetKit. The changes refactor WebSocket lifecycle management and message acknowledgment handling. Overall the code quality is good with clear separation of concerns, but there are several issues to address before merging.

Critical Issues

1. Incomplete WebSocket Adapter Constructor Call (tunnel.ts:139-193)

The #createWebSocket method has a critical bug. The constructor signature expects 9 parameters but only receives 6:

// tunnel.ts:148-186
const adapter = new WebSocketTunnelAdapter(
    requestIdStr,
    hibernatable,
    request,
    (data: ArrayBuffer | string, isBinary: boolean) => { /* ... */ },
    (code?: number, reason?: string, retry: boolean = false) => { /* ... */ },
);

But the constructor signature (websocket-tunnel-adapter.ts:49-73) requires:

constructor(
    tunnel: Tunnel,          // MISSING
    actorId: string,         // MISSING  
    requestId: string,       // PROVIDED
    hibernatable: boolean,   // PROVIDED
    request: Request,        // PROVIDED
    messageIndex: number,    // MISSING
    persistHibernatableWebSocketMessageIndex: (messageIndex: number) => void, // MISSING
    sendCallback: (data: ArrayBuffer | string, isBinary: boolean) => void,    // PROVIDED
    closeCallback: (code?: number, reason?: string, retry?: boolean) => void, // PROVIDED
)

This will cause a runtime error. The parameters are misaligned - tunnel, actorId, messageIndex, and the persist callback are missing.

2. Loop Variable Declaration Error (tunnel.ts:116)

for (const i = 0; i < requestIds.length; i++) {

Should be:

for (let i = 0; i < requestIds.length; i++) {

const cannot be used with reassignable loop counters.

3. Inconsistent Property Access (websocket-tunnel-adapter.ts:97)

The adapter references ws.canHibernate in tunnel.ts:97 (old code path) but the new implementation changed this to ws.hibernatable. Ensure all references are updated consistently.

4. Missing Request Parameter (tunnel.ts:120-126)

When calling #createWebSocket during restoration:

this.#createWebSocket(
    actorId,
    requestId,
    idToStr(requestId),
    true,
    messageIndex,
);  // Missing 'request' parameter

The method signature requires a Request parameter but it's not provided here. You'll need to reconstruct the request from stored metadata or adjust the restoration flow.

High Priority Issues

5. Breaking API Change Without Migration Path (mod.ts:60-66)

The websocket handler changed from async returning Promise<void> to synchronous returning void:

// Before
websocket?: (runner, actorId, ws, requestId, request) => Promise<void>;

// After  
websocket: (runner, actorId, ws, requestId, request) => void;

This is a breaking change that will affect all existing consumers. Consider:

  • Adding deprecation warnings
  • Providing a migration guide
  • Supporting both signatures temporarily with adapter logic

6. Unused Symbols Imported (tunnel.ts:12-16)

import {
    SET_CAN_HIBERNATE_SYMBOL,
    SET_REQUEST_SYMBOL,
    WebSocketTunnelAdapter,
} from "./websocket-tunnel-adapter";

SET_CAN_HIBERNATE_SYMBOL and SET_REQUEST_SYMBOL are imported but never used in the file. Either use them or remove the imports.

7. Missing Validation (tunnel.ts:104-134)

The restoreHibernatingRequests method should validate:

  • That actorId exists and is valid
  • That requestIds array is not empty
  • That message indices are within valid range (0-65535 per tunnel.ts:771)

8. Incomplete TODO Comments (tunnel.ts:192, websocket-tunnel-adapter.ts:517)

Several critical TODOs are left in the code:

  • // TODO: assign request to ws so we can pass this through (tunnel.ts:192)
  • // TODO: empty comment (websocket-tunnel-adapter.ts:517 in dispatchEvent)

These should either be implemented or documented with follow-up issue numbers.

Medium Priority Issues

9. Error Handling in Message Sequence Logic (websocket-tunnel-adapter.ts:103-157)

The message sequencing logic is well-designed, but consider edge cases:

  • What happens if messageIndex wraps around after 65535?
  • Should there be a maximum gap threshold before closing?
  • The gap calculation on line 150 could overflow for extreme cases

10. Race Condition Potential (tunnel.ts:662-674)

There's a check for duplicate WebSockets, but in a concurrent scenario:

const existingAdapter = this.#actorWebSockets.get(requestIdStr);
if (existingAdapter) {
    existingAdapter.__closeWithoutCallback(1000, "ws.duplicate_open");
}

If two open messages arrive simultaneously, both could pass the check before either is added to the map. Consider using a lock or atomic operation.

11. Memory Leak Risk (tunnel.ts:97)

The check if (!ws.canHibernate) means hibernatable WebSockets are NOT closed on shutdown:

if (!ws.canHibernate) {
    ws.__closeWithoutCallback(1000, "ws.tunnel_shutdown");
}

But they're immediately cleared from the map on line 101. This means the WebSocket objects remain in memory without being properly cleaned up. Should hibernatable WebSockets be persisted to storage here?

Minor Issues / Code Quality

12. Logging Inconsistencies

  • Some log messages use lowercase (tunnel.ts:129: "restored hibernatable requests")
  • Others use mixed case (per CLAUDE.md convention)
  • Consider standardizing: log messages should be lowercase unless mentioning code symbols

13. Magic Numbers (websocket-tunnel-adapter.ts:154)

this.close(1008, closeReason);

Consider defining WebSocket close codes as named constants:

const WS_CLOSE_POLICY_VIOLATION = 1008;

14. Type Safety (tunnel.ts:145-146)

Parameters hibernatable and messageIndex are passed positionally. With 6+ parameters, consider using an options object:

#createWebSocket(options: {
    actorId: string;
    requestId: RequestId;
    requestIdStr: string;
    hibernatable: boolean;
    messageIndex: number;
    request: Request;
}): WebSocketTunnelAdapter

15. Commented Out Code (actor-driver.ts:81-87)

Large blocks of commented code should be removed:

// // WebSocket message acknowledgment debouncing for hibernatable websockets
// #hibernatableWebSocketAckQueue: Map<...> = new Map();
// #wsAckFlushInterval?: NodeJS.Timeout;

If this code might be needed, reference the commit SHA in a comment instead.

16. OpenAPI Changes Not Reviewed (rivetkit-openapi/openapi.json)

The PR removes 277 lines from the OpenAPI spec. Ensure:

  • These are intentional removals, not accidental
  • Any deprecated endpoints are documented
  • Client SDKs are updated accordingly

Performance Considerations

17. UUID String Conversion (tunnel.ts:807-809)

idToStr() is called frequently and performs string conversion. Consider:

  • Caching string representations where possible
  • Using the ArrayBuffer directly as map keys (with custom comparison)
  • Profiling to see if this is a bottleneck

18. Missing Index Validation Before Persistence (websocket-tunnel-adapter.ts:160-163)

Before persisting the message index, validate it's in the acceptable range:

if (messageIndex < 0 || messageIndex > 65535) {
    throw new Error(`Invalid message index: ${messageIndex}`);
}
this.messageIndex = messageIndex;

Testing Recommendations

  1. Add test for WebSocket restoration flow - Verify that restoreHibernatingRequests correctly recreates WebSocket state
  2. Test message sequence validation - Cover duplicate, skipped, and out-of-order messages
  3. Test concurrent open messages - Verify race condition handling
  4. Test error paths - Ensure errors in config.websocket handler properly close connections
  5. Load testing - Verify performance with many hibernating WebSockets

Security Considerations

19. Message Index Validation

The message index sequence validation is good, but ensure:

  • Indices cannot be manipulated to bypass sequence checks
  • The range validation at tunnel.ts:771-772 is consistently applied
  • Consider rate limiting if sequence violations occur repeatedly

Documentation

20. Missing JSDoc

Key methods lack documentation:

  • restoreHibernatingRequests - Should explain the restoration contract
  • #createWebSocket - Should document when each parameter is used
  • persistHibernatableWebSocketMessageIndex callback - Needs contract documentation

Positive Aspects

✅ Good separation of concerns between tunnel and adapter
✅ Clear lifecycle management with explicit state transitions
✅ Comprehensive logging for debugging
✅ Proper error handling in most paths
✅ Message sequencing logic is thorough and well-commented
✅ Cleanup logic on shutdown is mostly correct

Recommendations

Before merging:

  1. MUST FIX: Constructor parameter mismatch (Issue [SVC-2555] Set up issue templates #1)
  2. MUST FIX: Loop variable declaration (Issue [SVC-2479] Send cluster events to PostHog #2)
  3. MUST FIX: Missing request parameter in restoration (Issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  4. 🔴 SHOULD FIX: Breaking API changes need migration path (Issue [SVC-2358] Enable Redis memory overcommit #5)
  5. 🔴 SHOULD FIX: Complete or document TODOs (Issue Remove fallible uses of as in api services #8)

After merging:

  • Create follow-up issues for hibernatable WebSocket persistence strategy
  • Add comprehensive integration tests
  • Update documentation with new protocol details

Summary

The refactoring improves the WebSocket lifecycle management significantly, but critical bugs prevent this from being merged as-is. The constructor parameter mismatch (Issue #1) will cause immediate runtime failures. After addressing the critical issues, this will be a solid improvement to the codebase.

@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from f45c96f to 769c3f8 Compare November 13, 2025 21:39
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from dbe19ca to 426a752 Compare November 13, 2025 21:39
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Hibernating WebSocket Protocol Implementation

This PR implements a new hibernating WebSocket protocol that significantly refactors how WebSocket connections are managed, particularly around hibernation and message acknowledgment. Here's my comprehensive review:

🎯 Overall Assessment

Strengths:

  • Clear separation of concerns with improved architecture
  • Better handling of hibernatable WebSocket lifecycle
  • More robust message index tracking and validation
  • Removal of event buffering complexity

Areas of Concern:

  • Several incomplete TODOs and commented code
  • Missing required callback parameters
  • Potential compilation/runtime errors in TypeScript
  • Some logic gaps in the new implementation

🐛 Critical Issues

1. Compilation Error in tunnel.ts:116-126

for (const i = 0; i < requestIds.length; i++) {

This should be let i not const i. This will cause a compilation error.

Location: engine/sdks/typescript/runner/src/tunnel.ts:116

2. Missing Parameters in #createWebSocket

The #createWebSocket method is called on line 120-126 with 5 arguments, but the function signature requires 6 parameters (including request: Request).

// Line 120-126 - Missing 'request' parameter
this.#createWebSocket(
    actorId,
    requestId,
    idToStr(requestId),
    true,
    messageIndex,
);

Location: engine/sdks/typescript/runner/src/tunnel.ts:120-126
Fix: Either provide a placeholder Request object or make the request parameter optional in the function signature.

3. API Signature Changes Without Migration

The RunnerConfig.websocket callback changed from async returning Promise<void> to synchronous returning void. This is a breaking change that may affect existing implementations.

Old: websocket?: (runner, actorId, ws, requestId, request) => Promise<void>
New: websocket: (runner, actorId, ws, requestId, request) => void

Location: engine/sdks/typescript/runner/src/mod.ts:60-66
Impact: Any existing code awaiting this callback will break.

4. Incomplete WebSocketTunnelAdapter Constructor

The constructor signature in websocket-tunnel-adapter.ts appears to be missing required parameters based on usage. The constructor expects parameters like tunnel, actorId, requestId, etc., but #createWebSocket in tunnel.ts doesn't provide all of them.

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:49-79

5. Deleted Assert Function Not Replaced

In runner-protocol/src/index.ts, the local assert function was removed in favor of importing from node:assert, but this may not work in all environments (browsers, Deno without Node compatibility).

Location: engine/sdks/typescript/runner-protocol/src/index.ts:1


⚠️ Major Concerns

1. Missing Implementation: restoreHibernativeRequests

The new required callback restoreHibernativeRequests in RunnerConfig is not implemented in the diff. This will cause runtime errors.

restoreHibernativeRequests(
    actorId: string,
    requestId: protocol.RequestId[],
): number[];

Location: engine/sdks/typescript/runner/src/mod.ts:79-82
Impact: This is a required callback but no implementation is shown in the actor-driver.ts changes.

2. Inconsistent API: getActorHibernationConfig → canWebSocketHibernate

The old getActorHibernationConfig returned a HibernationConfig object with both enabled and lastMsgIndex. The new canWebSocketHibernate only returns a boolean. However, the actor-driver.ts file still references the old API (lines 138-296 in the diff).

This indicates incomplete refactoring.

Location: Multiple files

3. Commented Out Code

The actor-driver.ts has large blocks of commented code that should be removed:

  • Lines 81-86: Commented hibernatable WebSocket ack queue
  • Lines 316-319: References to deleted #wsAckFlushInterval
  • Lines 338-349: Implementation of #flushHibernatableWebSocketAcks() that references deleted properties
  • Lines 645-672: Commented WebSocket message index update logic

Recommendation: Either complete the implementation or remove the dead code.


📝 Code Quality Issues

1. Type Safety: Unused Symbols

import {
    SET_CAN_HIBERNATE_SYMBOL,
    SET_REQUEST_SYMBOL,
    WebSocketTunnelAdapter,
} from "./websocket-tunnel-adapter";

These symbols are imported but never used in the file.

Location: engine/sdks/typescript/runner/src/tunnel.ts:13-15

2. Incomplete Error Handling

In tunnel.ts:538, there's a typo in the error log:

this.log?.debug({ msg: "catught runner shutdown error" });

Should be "caught" not "catught".

Location: engine/sdks/typescript/runner/src/tunnel.ts:538

3. Missing Validation in Message Index Tracking

The _handleMessage method validates message indices but always calls #persistHibernatableWebSocketMessageIndex even for non-hibernatable WebSockets when the condition might be wrong:

if (this.hibernatable) {
    this.#persistHibernatableWebSocketMessageIndex(messageIndex);
}

This is correct, but the callback is never defined for non-hibernatable WebSockets, which could lead to runtime errors.

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:161-163

4. Inconsistent Property Access

The #log getter in WebSocketTunnelAdapter accesses this.#tunnel.log but #tunnel is only stored after construction:

constructor(...) {
    this.#tunnel = tunnel;  // Set here
    ...
}

get #log(): Logger | undefined {
    return this.#tunnel.log;  // Used here - could be undefined during construction
}

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:74, 45-46


🔒 Security & Performance

1. Message Index Validation

Good: The code properly validates message indices for hibernatable WebSockets and closes connections on out-of-sequence messages.

if (messageIndex !== previousIndex + 1) {
    this.close(1008, closeReason);
    return true;
}

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:139-156

2. Memory Leak Prevention

The code properly cleans up event listeners and pending messages on shutdown. However, the garbage collection interval (60 seconds) might be too long for high-throughput scenarios.

Recommendation: Consider making GC_INTERVAL configurable.

Location: engine/sdks/typescript/runner/src/tunnel.ts:18


✅ Positive Changes

  1. Improved Architecture: The separation of WebSocket creation logic into #createWebSocket is cleaner.

  2. Better Logging: Structured logging with proper context (actorId, requestId, messageIndex) throughout.

  3. Duplicate Prevention: Handling of duplicate WebSocket open events is robust.

  4. Synchronous WebSocket Handler: Making the websocket callback synchronous (non-async) is actually better for event listener setup, preventing race conditions.

  5. Message Index Tracking: The new hibernatable WebSocket message tracking is more explicit and easier to reason about.


📋 Recommendations

High Priority

  1. Fix the compilation error (const → let in for loop)
  2. Implement restoreHibernativeRequests callback or document where it's implemented
  3. Fix #createWebSocket parameter mismatch
  4. Remove all commented/dead code or complete the implementation
  5. Reconcile the API changes between getActorHibernationConfig and canWebSocketHibernate

Medium Priority

  1. Fix typo: "catught" → "caught"
  2. Remove unused imports (SET_CAN_HIBERNATE_SYMBOL, SET_REQUEST_SYMBOL)
  3. Add migration guide for breaking changes to websocket callback
  4. Consider adding tests for message index validation logic

Low Priority

  1. Make GC_INTERVAL configurable
  2. Add JSDoc comments for the new hibernation protocol
  3. Consider extracting magic numbers to constants (e.g., WebSocket close codes)

🧪 Test Coverage

Missing Test Coverage:

  • Message index validation for hibernatable WebSockets
  • Restoration of hibernated WebSockets with message indices
  • Duplicate message handling
  • Out-of-sequence message handling
  • WebSocket close during hibernation

Recommendation: Add integration tests for the hibernation flow.


📊 Summary

Category Rating Notes
Architecture ⭐⭐⭐⭐ Good separation of concerns
Code Quality ⭐⭐⭐ Some dead code and TODOs
Type Safety ⭐⭐ Missing parameters, incomplete refactoring
Security ⭐⭐⭐⭐ Good validation of message indices
Performance ⭐⭐⭐⭐ Efficient message handling
Test Coverage ⭐⭐ Missing tests for critical paths

Overall: Needs Work ⚠️

This PR has good architectural improvements but requires fixing critical issues before merging:

  1. Compilation errors
  2. Missing required callback implementations
  3. Incomplete API refactoring
  4. Dead code cleanup

Once these issues are addressed, this will be a solid improvement to the WebSocket hibernation system.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: chore(rivetkit): implement new hibernating ws protocol

Overall, this is a solid refactor that improves the hibernating WebSocket protocol implementation. The changes simplify the architecture and improve reliability. Below are my detailed findings:

✅ Strengths

  1. Improved Lifecycle Management: The refactor centralizes WebSocket creation in #createWebSocket(), making it easier to manage both new and restored hibernating WebSockets consistently.

  2. Better Separation of Concerns: Moving WebSocket lifecycle logic out of event buffering improves code clarity. The removal of buffered events simplifies the adapter significantly.

  3. Enhanced Message Ordering: The message index validation in websocket-tunnel-adapter.ts:123-157 properly handles duplicate and out-of-sequence messages with clear error handling and logging.

  4. Cleaner API:

    • websocket handler is now synchronous (returns void instead of Promise<void>), which better reflects its actual behavior
    • Renamed sendWebsocketMessageAcksendHibernatableWebSocketMessageAck for clarity
    • Changed getActorHibernationConfigcanWebSocketHibernate with simpler return type
  5. Consistent Structured Logging: Good use of structured logging throughout, following the project's conventions.

🐛 Potential Bugs

  1. Missing request parameter in #createWebSocket call (tunnel.ts:120-126)

    this.#createWebSocket(
        actorId,
        requestId,
        idToStr(requestId),
        true,
        messageIndex,
        // Missing request parameter!
    );

    The function signature at line 139 requires a request: Request parameter, but it's not being passed when restoring hibernating requests. This will cause a runtime error.

    Fix: You need to either:

    • Pass the request object (likely need to restore it from hibernation metadata)
    • Make the parameter optional with a default value
    • Restructure to have separate methods for creating new vs restored WebSockets
  2. Synchronous handler called but errors not caught (tunnel.ts:703-709)
    The config.websocket handler is now synchronous but called without try-catch inside the outer try-catch. If it throws synchronously, it's caught correctly, but this could be made more explicit.

  3. Index bounds check may be too restrictive (tunnel.ts:771-772)

    if (index < 0 || index > 65535)
        throw new Error("invalid websocket ack index");

    This assumes message indices fit in a u16 (0-65535), but long-lived WebSocket connections might exceed this. Consider if BigInt or larger range is needed, or document the limitation.

⚠️ Code Quality Issues

  1. Typo in log message (tunnel.ts:538)

    this.log?.debug({ msg: "catught runner shutdown error" });

    Should be "caught" not "catught".

  2. Incomplete TODO (tunnel.ts:192)

    // TODO: assign request to ws so we can pass this through

    This TODO at the end of #createWebSocket should be either completed or removed if no longer relevant.

  3. Unused imports (websocket-tunnel-adapter.ts:12-15)

    import {
        SET_CAN_HIBERNATE_SYMBOL,
        SET_REQUEST_SYMBOL,
        WebSocketTunnelAdapter,
    } from "./websocket-tunnel-adapter";

    The symbols SET_CAN_HIBERNATE_SYMBOL and SET_REQUEST_SYMBOL are imported in tunnel.ts but don't appear to be used anywhere. These should be removed if not needed.

  4. Incomplete dispatchEvent implementation (websocket-tunnel-adapter.ts:516-519)

    dispatchEvent(event: any): boolean {
        // TODO:
        return true;
    }

    Either implement this properly or add a comment explaining why it's acceptable to leave unimplemented.

  5. Constructor signature is complex (websocket-tunnel-adapter.ts:49-73)
    The WebSocket adapter constructor now takes 9 parameters, several marked @experimental. Consider:

    • Using an options object for better maintainability
    • Documenting what "experimental" means for these public properties
    • Clarifying the TODOs about making some properties internal-only

🔒 Security Considerations

  1. No validation on messageIndex persistence callback
    The persistHibernatableWebSocketMessageIndex callback is called with user-controlled index values. Ensure the implementation properly validates and persists this securely to prevent manipulation.

  2. Path validation in buildRequestForWebSocket (tunnel.ts:828-830)
    Good that you validate the path starts with /, but consider additional validation:

    • Path traversal attempts (../)
    • Maximum path length
    • Invalid characters

🚀 Performance Considerations

  1. Removed event buffering - This is actually a performance improvement. The old implementation buffered events which could lead to memory issues with long-lived connections or slow consumers. The new approach is more direct and efficient.

  2. Message index bounds - The u16 limit on message indices (if that's the intent) means WebSockets will need to close/reopen after 65,536 messages. For high-throughput applications, this could cause connection churn. Document this limitation if it's intentional.

📝 Documentation

  1. API changes need documentation: The breaking changes to RunnerConfig (especially websocket, restoreHibernativeRequests, canWebSocketHibernate) should be documented in migration notes or changelog.

  2. HibernationConfig interface change: The structure changed from { enabled: boolean; lastMsgIndex: number | undefined } to just the config being optional. This is cleaner but needs documentation for users.

🧪 Test Coverage

The PR doesn't include test changes. Consider adding tests for:

  • Message index validation (duplicate, skipped, out-of-order)
  • WebSocket restoration from hibernation
  • Error handling paths in #createWebSocket
  • The new synchronous websocket handler behavior

Recommendations

High Priority:

  1. 🔴 Fix the missing request parameter in restoreHibernatingRequests - this will cause runtime errors
  2. 🟡 Fix typo: "catught" → "caught"
  3. 🟡 Remove unused imports or use them

Medium Priority:
4. Document the message index bounds and any limitations
5. Consider using an options object for WebSocketTunnelAdapter constructor
6. Add migration notes for API changes
7. Implement or document the incomplete dispatchEvent

Low Priority:
8. Consider path validation improvements
9. Add test coverage for new hibernation logic
10. Review and complete/remove TODOs

Great work on simplifying the architecture! Just needs the critical bug fix before merging.

@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from 769c3f8 to fc8f1fc Compare November 13, 2025 22:27
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from 426a752 to 04be372 Compare November 13, 2025 22:27
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from 04be372 to 9d6c4c8 Compare November 13, 2025 22:35
@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from fc8f1fc to e5ca471 Compare November 13, 2025 22:35
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Hibernating WebSocket Protocol Implementation

Critical Issues

1. Import Error in websocket-tunnel-adapter.ts (Line 7)

Severity: High - Compilation Error

import { wrappingLte } from "./utils";

This import is incorrect. The function wrappingLte does not exist in utils.ts. The correct function name is wrappingLteU16.

Line 124 uses this non-existent function:

if (wrappingLte(messageIndex, previousIndex)) {

Fix: Change the import to:

import { wrappingLteU16 } from "./utils";

And update line 124 to:

if (wrappingLteU16(messageIndex, previousIndex)) {

2. Missing request Parameter in tunnel.ts (Line 125-126)

Severity: High - Runtime Error

In tunnel.ts:125-126, #createWebSocket is called without the required request parameter:

this.#createWebSocket(
    actorId,
    requestId,
    idToStr(requestId),
    true,
    messageIndex,
);

But the function signature at line 139-146 requires a request parameter:

#createWebSocket(
    actorId: string,
    requestId: RequestId,
    requestIdStr: string,
    hibernatable: boolean,
    messageIndex: number,
    request: Request,  // <-- Missing in the call above
): WebSocketTunnelAdapter

Impact: This will cause runtime errors when trying to restore hibernating WebSockets.

Fix: You need to either:

  1. Store the request metadata during hibernation and restore it, or
  2. Make the request parameter optional for restoration scenarios and build a dummy request

3. Incomplete Migration in actor-driver.ts

Severity: Medium - Dead Code

The code contains several issues related to the protocol migration:

  1. Lines 81-86: Commented-out hibernatable WebSocket acknowledgment queue declarations, but the queue is still used throughout the file (lines 339-349, 649, 656, 770)

  2. Lines 138-296: The entire getActorHibernationConfig callback is defined but this API has been replaced with canWebSocketHibernate and restoreHibernativeRequests in the new protocol. This is dead code.

  3. Lines 316-319, 339-349, 691, 840: References to #wsAckFlushInterval and #flushHibernatableWebSocketAcks() but these properties are not declared in the class (they're in commented code).

  4. Line 345: Calls this.#runner.sendWebsocketMessageAck(requestId, index) but the runner API has been renamed to sendHibernatableWebSocketMessageAck (as seen in mod.ts:1444-1447).

Fix: Either:

  • Remove all hibernatable WebSocket code from actor-driver.ts if it's handled elsewhere, or
  • Properly implement the new protocol with restoreHibernativeRequests and canWebSocketHibernate callbacks

Code Quality Issues

4. Loop Variable Declaration Issue in tunnel.ts (Line 116)

Severity: Low - Style

for (const i = 0; i < requestIds.length; i++) {

This should use let instead of const since i is being reassigned:

for (let i = 0; i < requestIds.length; i++) {

5. Missing Implementation in actor-driver.ts

Severity: High - Incomplete Implementation

The new RunnerConfig interface requires two new callbacks:

  • restoreHibernativeRequests (line 23-26 in mod.ts)
  • canWebSocketHibernate (line 28-32 in mod.ts)

These are NOT implemented in the EngineActorDriver constructor (lines 113-297). The engine runner config only defines the old getActorHibernationConfig callback.

Fix: Remove the old callback and implement the new ones:

canWebSocketHibernate: (actorId: string, requestId: ArrayBuffer, request: Request): boolean => {
    // Implementation
},
restoreHibernativeRequests: (actorId: string, requestIds: protocol.RequestId[]): number[] => {
    // Implementation
},

6. Inconsistent Error Handling

Severity: Medium

In websocket-tunnel-adapter.ts:141-156, when a message index is out of sequence, the code:

  1. Logs a warning
  2. Closes the WebSocket
  3. But then still returns true (indicating the message was handled)

This seems inconsistent. If the connection is being closed due to an invalid state, should the return value indicate an error?


Performance Considerations

7. U16 Wrapping Functions

The new wrapping comparison functions in utils.ts (lines 74-109) look correct and have comprehensive test coverage. Good work on the test suite!

8. Event Handler Timing

The refactored WebSocket handling in tunnel.ts:700-705 now calls config.websocket() before sending the open confirmation. This is better for error handling but note that any exceptions in the user's websocket handler will prevent the connection from opening.


Test Coverage

The addition of tests/utils.test.ts with comprehensive tests for the wrapping comparison functions is excellent. The tests cover edge cases well including:

  • Boundary conditions at u16 max
  • Wrapping behavior
  • Transitivity across wrap boundaries

Summary

This PR cannot be merged in its current state due to critical compilation and runtime errors:

  1. Fix the import error (wrappingLtewrappingLteU16)
  2. Fix the missing request parameter in restoreHibernatingRequests
  3. Complete the migration in actor-driver.ts by implementing the new callbacks and removing old code
  4. Fix the loop variable declaration

The conceptual changes look good - moving to a cleaner hibernation protocol with explicit restoration. Once the implementation is complete, this should be a solid improvement.

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Hibernating WebSocket Protocol Implementation

Summary

This PR implements a new hibernating WebSocket protocol for RivetKit. The changes refactor how WebSocket connections are managed, adding support for hibernatable connections that can survive actor restarts by persisting and restoring connection state.

Positive Observations

Code Quality

  • Well-structured refactoring: The separation of concerns between Tunnel and WebSocketTunnelAdapter is clean
  • Excellent test coverage: The wrapping arithmetic utilities have comprehensive unit tests with edge cases
  • Good documentation: Comments explain complex logic like message index sequencing and hibernation behavior
  • Proper error handling: The code handles duplicate connections, out-of-sequence messages, and WebSocket lifecycle edge cases

Implementation Strengths

  • Defensive programming: Check for duplicate WebSocket opens (tunnel.ts:665-674)
  • Message ordering: Proper sequence validation with wrapping arithmetic for u16 indices
  • Lifecycle management: Clean separation between normal close and close-with-retry semantics

Issues & Recommendations

Critical Issues

1. Bug: Missing import in tunnel.ts

Location: tunnel.ts:7

import { wrappingLte } from "./utils";

Issue: The import references wrappingLte but utils.ts only exports wrappingLteU16. This will cause a runtime error.
Fix: Change to import { wrappingLteU16 as wrappingLte } from "./utils"; or use wrappingLteU16 directly

2. Bug: Unused imports in tunnel.ts ⚠️

Location: tunnel.ts:12-15

import {
	SET_CAN_HIBERNATE_SYMBOL,
	SET_REQUEST_SYMBOL,
	WebSocketTunnelAdapter,
} from "./websocket-tunnel-adapter";

Issue: SET_CAN_HIBERNATE_SYMBOL and SET_REQUEST_SYMBOL are imported but never used. These symbols don't appear to be exported from websocket-tunnel-adapter.ts
Fix: Remove unused imports

3. Bug: Missing request parameter in restoreHibernatingRequests

Location: tunnel.ts:120-126

this.#createWebSocket(
	actorId,
	requestId,
	idToStr(requestId),
	true,
	messageIndex,
);

Issue: #createWebSocket requires 6 parameters including request: Request (line 145), but only 5 are provided. This will cause a TypeScript error.
Fix: Need to restore the request metadata (path, headers) from hibernation state and pass a reconstructed Request object

4. Bug: Constructor signature mismatch

Location: tunnel.ts:148-187 calls WebSocketTunnelAdapter constructor
The constructor call doesn't match the signature in websocket-tunnel-adapter.ts:50-74

  • Constructor expects: tunnel, actorId, requestId, hibernatable, request, messageIndex, persistCallback, sendCallback, closeCallback
  • Constructor is called with: requestIdStr, hibernatable, request, sendCallback, closeCallback

Fix: Update the constructor call to match the new signature

Moderate Issues

5. Missing persistence callback implementation ⚠️

Location: tunnel.ts:148-187
Issue: The #createWebSocket method doesn't implement the persistHibernatableWebSocketMessageIndex callback that the WebSocketTunnelAdapter requires. This callback is critical for hibernation to work.
Fix: Add a callback that persists the message index via the runner config

6. Incomplete HibernatingWebSocketMetadata usage ⚠️

Location: tunnel.ts:29-33

export interface HibernatingWebSocketMetadata {
	path: string;
	headers: Record<string, string>;
	messageIndex: number;
}

Issue: This interface is defined but never used in the code. It should be used to restore WebSocket state in restoreHibernatingRequests.

7. API breaking change without migration path ⚠️

Location: mod.ts:60-66

websocket: (
	runner: Runner,
	actorId: string,
	ws: any,
	requestId: protocol.RequestId,
	request: Request,
) => void;  // Changed from Promise<void>

Issue: The websocket handler changed from async to sync without backward compatibility. This will break existing implementations.
Recommendation: Document this breaking change or provide a migration guide

8. Typo in method name ⚠️

Location: mod.ts:79

restoreHibernativeRequests(

Issue: Should be restoreHibernatingRequests (not "Hibernative")
Fix: Rename to use consistent terminology

Minor Issues

9. Inconsistent error message 📝

Location: tunnel.ts:538

this.log?.debug({ msg: "catught runner shutdown error" });

Issue: Typo "catught" should be "caught"

10. TODO comment needs resolution 📝

Location: tunnel.ts:192

// TODO: assign request to ws so we can pass this through

Issue: This TODO suggests incomplete implementation
Recommendation: Either implement or create a tracking issue

11. Missing validation 📝

Location: websocket-tunnel-adapter.ts:104-204
Issue: The _handleMessage method doesn't validate that messageIndex is within u16 range (0-65535)
Fix: Add validation: if (messageIndex < 0 || messageIndex > 65535) throw new Error(...)

12. Wrapping arithmetic edge case 📝

Location: websocket-tunnel-adapter.ts:124

if (wrappingLte(messageIndex, previousIndex)) {

Issue: Uses undefined wrappingLte - should use wrappingLteU16
Impact: Runtime error if this code path is reached

Performance Considerations

13. Potential memory leak 🔍

Location: websocket-tunnel-adapter.ts:243-318
Issue: The #fireEvent method doesn't appear to have limits on #eventListeners Map growth
Recommendation: Consider adding cleanup logic for removed WebSockets

14. Efficient buffer operations

The SharedArrayBuffer handling in websocket-tunnel-adapter.ts:436-476 is well-implemented with proper slicing to avoid security issues.

Security Considerations

15. Path validation

Location: tunnel.ts:828-830

if (!path.startsWith("/")) {
	throw new Error("path must start with leading slash");
}

Good validation to prevent path traversal issues.

16. Index boundary validation

Location: tunnel.ts:771-772

if (index < 0 || index > 65535)
	throw new Error("invalid websocket ack index");

Proper u16 range validation prevents overflow attacks.

Test Coverage

17. Excellent wrapping arithmetic tests

Location: tests/utils.test.ts

  • Comprehensive edge case testing (boundaries, wrapping, transitivity)
  • Tests cover the critical u16 wrapping logic
  • Good use of property-based testing concepts

18. Missing integration tests ⚠️

Issue: No tests for the actual hibernation flow (persist → restore → message delivery)
Recommendation: Add integration tests that:

  • Simulate actor restart with hibernating WebSocket
  • Verify message indices are restored correctly
  • Test duplicate message handling after restart

Architecture Concerns

19. Circular dependency risk 🔍

Location: tunnel.ts imports from websocket-tunnel-adapter.ts, which imports from tunnel.ts
Issue: WebSocketTunnelAdapter has #tunnel: Tunnel but Tunnel creates WebSocketTunnelAdapter
Impact: Potential initialization issues
Recommendation: Consider using dependency injection or extracting shared interfaces

20. Logging conventions 📝

Per CLAUDE.md, log messages should be lowercase unless mentioning code symbols. The code follows this correctly:

this.log?.info({ msg: "restored hibernatable requests", ... })

✅ Good adherence to conventions

Action Items

Must Fix Before Merge

  1. Fix wrappingLte import error (Issue [SVC-2555] Set up issue templates #1)
  2. Fix missing request parameter in restoreHibernatingRequests (Issue [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)
  3. Fix WebSocketTunnelAdapter constructor call (Issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  4. Implement persistence callback (Issue [SVC-2358] Enable Redis memory overcommit #5)
  5. Fix typo in method name restoreHibernativeRequests (Issue Remove fallible uses of as in api services #8)

Should Fix

  1. Remove unused imports (Issue [SVC-2479] Send cluster events to PostHog #2)
  2. Implement or document HibernatingWebSocketMetadata usage (Issue Remove servers not in salt #6)
  3. Document breaking API change for websocket handler (Issue [SVC-2404] Improve internal documentation #7)
  4. Fix typo "catught" (Issue [SOC-15] Fix error when running user-delete #9)

Nice to Have

  1. Add integration tests for hibernation flow (Issue [SOC-21] Re-enable game-user-recommend #18)
  2. Resolve or track TODO comments (Issue [SVC-2479] Send cluster events to PostHog #10)
  3. Add message index validation in _handleMessage (Issue [SOC-15] Fix error when running user-delete #11)

Conclusion

This is a well-architected feature with strong test coverage for core utilities. However, there are several critical bugs that will prevent the code from compiling or running correctly. The main issues stem from incomplete refactoring - particularly around the WebSocketTunnelAdapter constructor signature changes and the missing request parameter restoration logic.

Once the critical issues are addressed, this will be a solid implementation of the hibernating WebSocket protocol. The defensive programming, error handling, and edge case coverage demonstrate high code quality.

Recommendation: Address critical issues before merge, particularly the constructor signature mismatches and missing request restoration logic.

@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from 9d6c4c8 to 449ac28 Compare November 13, 2025 22:42
@NathanFlurry NathanFlurry force-pushed the 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive branch from e5ca471 to 5b5466a Compare November 13, 2025 22:42
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Implement New Hibernating WebSocket Protocol

I've reviewed the changes and found several critical bugs that will prevent compilation, along with some recommendations for improvement.

Critical Bugs (Must Fix)

1. Import Name Mismatch in websocket-tunnel-adapter.ts

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:7

import { wrappingLte } from "./utils";

Issue: The function is named wrappingLteU16 in utils.ts, not wrappingLte. This will cause a compilation error.

Fix: Change the import to:

import { wrappingLteU16 } from "./utils";

And update the usage on line 124:

if (wrappingLteU16(messageIndex, previousIndex)) {

2. Missing Request Parameter in tunnel.ts

Location: engine/sdks/typescript/runner/src/tunnel.ts:116-126

for (const i = 0; i < requestIds.length; i++) {
    const requestId = requestIds[i];
    const messageIndex = messageIndices[i];

    this.#createWebSocket(
        actorId,
        requestId,
        idToStr(requestId),
        true,
        messageIndex,
    );  // Missing 6th parameter: request
}

Issue: #createWebSocket requires 6 parameters (see line 139-145), but only 5 are provided. The request: Request parameter is missing.

Fix: You need to either:

  1. Store hibernating WebSocket metadata (path, headers) during hibernation and reconstruct the Request object when restoring, or
  2. Make the request parameter optional in #createWebSocket and handle the undefined case

3. Commented Out Field Declarations Still Being Used

Location: rivetkit-typescript/packages/rivetkit/src/drivers/engine/actor-driver.ts:81-86

// // WebSocket message acknowledgment debouncing for hibernatable websockets
// #hibernatableWebSocketAckQueue: Map<
// 	string,
// 	{ requestIdBuf: ArrayBuffer; messageIndex: number }
// > = new Map();
// #wsAckFlushInterval?: NodeJS.Timeout;

Issue: These fields are commented out but are being referenced throughout the file (lines 316, 339, 344, 348, 649, 653, 770, 834-836). This will cause compilation errors.

Fix: Uncomment these field declarations.

4. Invalid for Loop Declaration

Location: engine/sdks/typescript/runner/src/tunnel.ts:116

for (const i = 0; i < requestIds.length; i++) {

Issue: Using const in a traditional for loop is invalid - the loop variable must be mutable.

Fix: Change to:

for (let i = 0; i < requestIds.length; i++) {

5. Missing Implementation in actor-driver.ts

Location: rivetkit-typescript/packages/rivetkit/src/drivers/engine/actor-driver.ts

Issue: The PR changes RunnerConfig to require two new methods:

  • restoreHibernativeRequests(actorId: string, requestId: RequestId[]): number[]
  • canWebSocketHibernate(actorId: string, requestId: ArrayBuffer, request: Request): boolean

But the implementation in actor-driver.ts still uses the old API:

  • getActorHibernationConfig (lines 138-296)

Fix: The actor-driver.ts needs to be updated to implement the new API instead of the old one.

Code Quality Issues

6. Removed Return Statement

Location: engine/sdks/typescript/runner/src/tunnel.ts:193

// TODO: assign request to ws so we can pass this through

Issue: The #createWebSocket method is declared to return WebSocketTunnelAdapter but there's no return statement. This will cause compilation errors or undefined behavior.

Fix: Add a return statement:

// Store adapter
this.#actorWebSockets.set(requestIdStr, adapter);

return adapter;

7. WebSocket Constructor Signature Mismatch

Location: engine/sdks/typescript/runner/src/tunnel.ts:148-151

The old constructor signature appears to be:

new WebSocketTunnelAdapter(
    requestIdStr,
    hibernatable,
    request,
    // ... callbacks
)

But based on the new code in websocket-tunnel-adapter.ts (lines 50-74), the signature should be:

new WebSocketTunnelAdapter(
    tunnel,
    actorId,
    requestId,
    hibernatable,
    request,
    messageIndex,
    persistHibernatableWebSocketMessageIndex,
    sendCallback,
    closeCallback,
)

This needs to be verified and fixed.

Design Concerns

8. Message Index Persistence Strategy

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:28

The #persistHibernatableWebSocketMessageIndex callback is defined but never called or passed to the constructor in the current implementation. This is a critical gap in the hibernation protocol - without persisting message indices, you can't properly restore WebSocket state.

9. Async to Sync Change Without Clear Justification

Location: engine/sdks/typescript/runner/src/mod.ts:60-66

websocket: (
    runner: Runner,
    actorId: string,
    ws: any,
    requestId: protocol.RequestId,
    request: Request,
) => void;  // Changed from Promise<void> to void

Concern: This breaking change from async to sync might cause issues if implementations need to perform async operations during WebSocket setup. Consider whether this is intentional and document the reasoning.

Positive Aspects

  1. Excellent test coverage for the wrapping arithmetic functions in utils.test.ts - comprehensive edge case testing including wrap boundaries and transitivity
  2. Good error handling for message index validation (lines 122-142 in websocket-tunnel-adapter.ts)
  3. Clear documentation in comments explaining the hibernation protocol
  4. Proper separation of concerns between Tunnel and WebSocketTunnelAdapter

Recommendations

  1. Run TypeScript compilation before submitting - these bugs would be caught immediately
  2. Consider adding integration tests for the hibernation/restoration flow
  3. Document the migration path from the old getActorHibernationConfig API to the new canWebSocketHibernate/restoreHibernativeRequests API
  4. Add JSDoc comments to the new public methods explaining the hibernation protocol

Summary

This PR implements an important feature but has several compilation-blocking bugs that need to be fixed before it can be merged. The core logic appears sound, but the implementation is incomplete in several areas.

Recommendation: Address the 5 critical bugs before merging. The code quality issues should also be fixed but are less critical.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: chore(rivetkit): implement new hibernating ws protocol

Summary

This PR refactors the WebSocket tunnel implementation to support a new hibernating WebSocket (HWS) protocol. The changes affect the TypeScript runner SDK and RivetKit engine driver, implementing message sequencing, hibernation support, and improved lifecycle management.

Code Quality & Best Practices

✅ Strengths

  1. Clear separation of concerns: The refactoring properly separates WebSocket creation, message handling, and hibernation logic
  2. Better error handling: Improved handling of edge cases like duplicate opens, message index skipping, and out-of-sequence messages
  3. Type safety: Good use of TypeScript types and interfaces throughout
  4. Documentation: Added helpful JSDoc comments explaining complex behavior (e.g., restoreHibernativeRequests)

⚠️ Issues & Concerns

1. Typo in naming (Minor)

  • restoreHibernativeRequests should be restoreHibernatingRequests (line 78 in mod.ts)
  • Also appears in interface definition and implementation

2. Breaking API changes (Major)

The websocket handler signature changed from async to sync:

// Before
websocket?: (...) => Promise<void>;

// After  
websocket: (...) => void;

This is a breaking change that could cause issues for existing implementations. Consider:

  • Documenting this breaking change in release notes
  • Providing migration guidance for users
  • Or maintaining backward compatibility with a wrapper

3. Missing error handling in loop (Minor)

Line 116 in tunnel.ts:

for (const i = 0; i < requestIds.length; i++) {

This should use let instead of const for the loop variable.

4. Inconsistent function signature (Medium)

#createWebSocket is called with different numbers of parameters:

  • Line 118-125: Called with 5 params (during restore)
  • Line 691-698: Called with 6 params (during open)

The function definition expects 6 parameters, so the restore call is missing the request parameter. This would cause a runtime error.

Fix needed:

// Line 118-125 should build a request before calling createWebSocket
const request = buildRequestForWebSocket(???, ???); // Need path and headers
const adapter = this.#createWebSocket(
    actorId,
    requestId,
    requestIdStr,
    true,
    messageIndex,
    request, // Missing!
);

5. Removed validation (Medium)

The check for websocketHandler existence was removed from #handleWebSocketOpen. Now if config.websocket is undefined, it will throw at runtime instead of sending a graceful close message. Consider adding back the validation or documenting that websocket is now required.

6. Wrapping comparison function name mismatch (Critical)

Line 25 in websocket-tunnel-adapter.ts:

import { wrappingLte } from "./utils";

But utils.ts exports wrappingLteU16, not wrappingLte. This would cause an import error.

Fix needed:

import { wrappingLteU16 } from "./utils";
// And update usage on line 87:
if (wrappingLteU16(messageIndex, previousIndex)) {

7. Async function became sync (Medium)

#handleWebSocketMessage was changed from async to sync, but the message handling might still need to support async operations. Verify this won't cause issues with message processing.

Performance Considerations

✅ Good

  • Removed unnecessary async/await in message path (if intentional)
  • Better memory management with explicit cleanup of actor tracking

⚠️ Potential Issues

  • Message buffering removed: The previous event buffering system was removed entirely. This could cause events to be lost if listeners aren't attached immediately. Verify this is intentional and won't cause race conditions.

Security Concerns

✅ Good

  • Proper validation of message indices prevents replay attacks
  • Duplicate WebSocket detection prevents resource exhaustion
  • Close on out-of-sequence messages prevents state corruption

⚠️ Minor Concerns

  • Consider rate limiting on WebSocket open attempts per actor
  • The buildRequestForWebSocket creates a mock request - ensure this can't be exploited to bypass auth checks

Test Coverage

⚠️ Concerns

  • New test file added (utils.test.ts) for wrapping comparison functions - good!
  • No tests for the main changes: No new tests for hibernation logic, message sequencing, or the refactored WebSocket lifecycle
  • Missing edge case tests: Should test:
    • Message index wrapping at u16 boundary
    • Restoration with invalid message indices
    • Concurrent WebSocket open attempts
    • Hibernation/wake cycles

Additional Recommendations

  1. Add integration tests for the hibernating WebSocket protocol
  2. Document the protocol - add comments explaining the hibernation flow and message index semantics
  3. Error messages: Consider adding error codes to all close reasons for better debugging
  4. Metrics: Consider adding telemetry for hibernation events, message indices, and restore success/failure rates

Critical Fixes Required

  1. Fix import error: wrappingLtewrappingLteU16 (would break at runtime)
  2. Fix function call: Add missing request parameter to #createWebSocket in restoreHibernatingRequests
  3. Fix loop variable: Change const i to let i in for loop

Summary

This is a significant refactoring that improves the hibernating WebSocket implementation, but it has several critical issues that would prevent it from working correctly:

  • Blocking: Import error and missing function parameter
  • Breaking: API changes need documentation
  • Testing: Needs comprehensive test coverage

The architecture and approach look solid, but please address the critical issues before merging.


Review generated with Claude Code

@NathanFlurry NathanFlurry changed the base branch from 11-11-fix_guard_dont_send_open_msg_for_hibernating_ws_hibernating_ws_keepalive to graphite-base/3464 November 13, 2025 22:48
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from 449ac28 to 7fadc23 Compare November 13, 2025 22:49
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Hibernatable WebSocket Protocol Implementation

Overview

This is a substantial and well-architected PR that implements a new hibernation protocol for WebSocket connections. The implementation spans both Rust (pegboard-gateway) and TypeScript (runner SDK, RivetKit) codebases, introducing the ability for WebSocket connections to persist across actor sleep/wake cycles.

Code Quality: ✅ Strong

Strengths

  1. Excellent Documentation

    • The HIBERNATABLE_CONNECTIONS.md document with Mermaid diagrams is outstanding and clearly explains the lifecycle
    • Well-commented code throughout, especially in complex areas
    • Good use of structured logging with context
  2. Comprehensive Test Coverage

    • New test suite in actor-hibernation.ts covers key scenarios:
      • Basic connection hibernation
      • State persistence through hibernation
      • Closing connections during hibernation
    • Tests verify both connection state and lifecycle hook invocations
    • Added utility functions with thorough unit tests (utils.test.ts)
  3. Well-Structured Architecture

    • Clear separation between ephemeral and hibernatable connections
    • State managers handle persistence logic cleanly
    • Connection lifecycle is well-defined with proper phases

Potential Issues & Concerns

🔴 Critical Issues

  1. Race Condition in Message Acknowledgment (shared_state.rs:241-260)

    let Some(mut in_flight) = self.in_flight_requests.get_async(&msg.request_id).await else {
        tracing::warn!(
            request_id=?Uuid::from_bytes(msg.request_id),
            message_id=?Uuid::from_bytes(msg.message_id),
            "in flight has already been disconnected, cannot ack message"
        );
        continue;
    };
    • The request could be removed between the check and the subsequent operations
    • Consider holding the lock or using a transaction-like pattern
  2. Unbounded Background Operations (tunnel.ts:131)

    const backgroundOperations: Promise<void>[] = [];
    • Background operations are tracked but errors are only logged
    • If many operations fail, this could silently cause issues
    • Consider adding a maximum retry count or circuit breaker
  3. Timeout Handling in Actor Start (mod.ts:949)

    // TODO: Add timeout to onActorStart
    • This TODO is concerning - if onActorStart hangs, it could block the entire actor
    • Should implement a timeout before going to production

🟡 High Priority Issues

  1. Error Handling in Restoration (tunnel.ts:187-201)

    • Errors during WebSocket restoration are logged but the connection is closed with code 1011
    • Consider if this is the appropriate error code and whether retry logic is needed
  2. Message Index Wrapping (utils.ts:74-98)

    • Good that you added wrapping arithmetic for u16 message indices
    • However, the wrapping behavior could cause issues if more than 32,767 unacknowledged messages accumulate
    • Consider documenting this limit or adding a check
  3. Memory Leak Potential (tunnel.ts:52)

    #requestToActor: Map<string, string> = new Map();
    • This map grows as requests are created
    • Verify that entries are properly cleaned up in #handleWebSocketClose and other cleanup paths
  4. Inconsistent Timeout Values

    • WEBSOCKET_OPEN_TIMEOUT: 15s vs TUNNEL_ACK_TIMEOUT: 5s (pegboard-gateway)
    • MESSAGE_ACK_TIMEOUT: 5s (tunnel.ts) vs MESSAGE_ACK_TIMEOUT: 30s (shared_state.rs)
    • Consider standardizing or documenting why these differ

🟢 Medium Priority Issues

  1. Logging Pattern Inconsistency

    • Some logs use ?err and some use just err
    • Some use lowercase messages, some use sentence case
    • The CLAUDE.md says to use lowercase, but verify consistency
  2. Structured Logging Missing Context (shared_state.rs:231)

    tracing::warn!(
        request_id=?Uuid::from_bytes(msg.request_id),
        message_id=?Uuid::from_bytes(msg.message_id),
        "in flight has already been disconnected, cannot ack message"
    );

    ✅ Good: This follows the structured logging pattern well

  3. Unnecessary Clone (shared_state.rs:271)

    let _ = in_flight.msg_tx.send(msg.message_kind.clone()).await;
    • msg.message_kind is cloned here but might not need to be
    • Consider if ownership transfer is possible
  4. TODO Comments (see grep results)

    • Several TODOs in the codebase, notably:
      • tunnel.ts:397: "TODO: Switch this with runner WS"
      • tunnel.ts:1038: "TODO: This will never retransmit the socket and the socket will close"
      • mod.ts:970: "TODO: Mark as crashed"
    • These should be addressed or tracked as follow-up issues

Performance Considerations

✅ Good Practices

  1. Throttled State Persistence

    • State saves are properly throttled with configurable intervals
    • Immediate save option available for critical updates
  2. Efficient Message Batching

    • Messages are buffered and acknowledged in batches (500KB threshold)
    • Reduces round trips to storage
  3. Garbage Collection

    • Regular GC intervals defined (GC_INTERVAL: 60s)
    • Cleans up stale connections and unacked messages

⚠️ Potential Optimizations

  1. Large Persistent Data Structures

    • pending_ws_msgs could grow large if messages aren't acknowledged
    • Consider adding circuit breaker if size exceeds reasonable limit
  2. Serialization Overhead

    • CBOR serialization happens frequently
    • Consider caching serialized messages if they're retransmitted

Security Concerns

✅ Good Security Practices

  1. Proper Authentication Context

    • Connections are authenticated before being established
    • Request context preserved through hibernation
  2. Resource Limits

    • HWS_MAX_PENDING_MSGS_SIZE_PER_REQ: 1MB
    • Prevents unbounded memory growth

⚠️ Recommendations

  1. Connection Limit

    • No visible limit on number of hibernatable connections per actor
    • Consider adding a per-actor connection limit
  2. Message Replay Protection

    • Verify that message indices prevent replay attacks after restoration
    • The wrapping index mechanism seems sound, but needs security review

Test Coverage Assessment

✅ Well Tested

  • Basic hibernation flow
  • State persistence
  • Connection lifecycle hooks
  • Wrapping arithmetic for message indices
  • Edge cases in actor destroy tests

🔴 Missing Tests

  1. Error Recovery

    • No tests for restoration failures
    • No tests for partial restoration (some connections succeed, others fail)
  2. Concurrent Operations

    • What happens if a message arrives during restoration?
    • What happens if actor is destroyed during hibernation?
  3. Message Ordering

    • Verify message order is preserved across hibernation
    • Test message index wrapping behavior with actual message flow
  4. Performance

    • No load tests for many hibernated connections
    • No tests for message buffer limits

Additional Recommendations

Code Style

  1. Import Organization

    • Imports are properly organized at the top of files (follows CLAUDE.md)
  2. Error Types

    • Good use of custom error types with metadata
    • Properly follows the derive macro pattern
  3. Naming Conventions

    • Timestamps use *_at with past tense (e.g., created_at)
    • State structures follow clear naming patterns

Documentation

  1. ✅ Excellent: HIBERNATABLE_CONNECTIONS.md provides clear lifecycle documentation
  2. ✅ Good: JSDoc comments explain hibernation requirements
  3. 🟡 Missing: Architecture decision record (ADR) for why this approach was chosen
  4. 🟡 Missing: Migration guide for existing users

Follow-up Work

Create issues for:

  1. Implement timeout for onActorStart (mod.ts:949)
  2. Address message retransmission TODO (tunnel.ts:1038)
  3. Add comprehensive error recovery tests
  4. Performance testing with many hibernated connections
  5. Review and standardize timeout values across codebase
  6. Add per-actor connection limits
  7. Security audit of message replay prevention

Summary

This is a high-quality implementation of a complex distributed systems feature. The code is well-structured, thoroughly documented, and includes good test coverage. The main concerns are:

  1. Some race conditions in the Rust code that need attention
  2. Several TODOs that should be addressed or tracked
  3. Missing error recovery test coverage
  4. Timeout handling needs to be implemented

Recommendation: Approve with minor revisions. The critical race condition issue should be addressed, and the onActorStart timeout should be implemented before merging. Other issues can be tracked as follow-up work.

Score: 8.5/10

Great work on this complex feature! 🎉

@NathanFlurry NathanFlurry force-pushed the 11-18-fix_actor-kv_fix_listing_keys branch from 6d04024 to d4f63d9 Compare November 20, 2025 22:32
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from f50891c to 7410ff0 Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Hibernating WebSocket Protocol Implementation

Overview

This is a substantial PR implementing a new hibernating WebSocket protocol for RivetKit. The changes enable WebSocket connections to persist while actors sleep/hibernate, which is a significant architectural improvement. Overall, the implementation is well-structured with good documentation and comprehensive test coverage.

Strengths ✅

1. Excellent Documentation

  • The HIBERNATABLE_CONNECTIONS.md document provides clear lifecycle diagrams and explanations
  • Well-documented sequence diagrams for connection restoration and message persistence
  • Comprehensive JSDoc comments explaining complex hibernation requirements

2. Strong Test Coverage

  • New hibernation test suite (actor-hibernation.ts) covers key scenarios:
    • Basic connection hibernation
    • State persistence through hibernation cycles
    • Connection closure during hibernation
  • Tests verify lifecycle hooks are called correctly

3. Good Error Handling

  • Consolidated error messages (e.g., "Actor does not exist or was destroyed")
  • Proper cleanup of zombie connections (connected but not persisted)
  • Proper cleanup of stale metadata (persisted but not connected)

4. Architectural Improvements

  • Moved WebSocket-specific logic into dedicated router-websocket-endpoints.ts
  • Better separation of concerns between tunnel, runner, and actor layers
  • Per-actor tracking of pending requests and WebSockets instead of global state

Issues & Concerns 🔴

1. Critical: Race Condition Risk in Hibernation Flow

// engine/sdks/typescript/runner/src/tunnel.ts:397
// TODO: Switch this with runner WS

There's a TODO comment suggesting the current implementation has a known issue. This needs to be addressed before merging.

Location: engine/sdks/typescript/runner/src/tunnel.ts:397

2. Memory Leak Potential

// rivetkit-typescript/packages/rivetkit/src/actor/instance/connection-manager.ts

The new #requestToActor mapping is created but I don't see explicit cleanup when actors are removed. While shutdown() clears it, individual actor stops should also clean up their entries to prevent memory leaks in long-running processes.

Recommendation: Add cleanup in #removeActor or when actors stop.

3. Error Handling: Unhandled Promise Rejection

// engine/sdks/typescript/runner/src/mod.ts:915
async #handleCommandStartActor(commandWrapper: protocol.CommandWrapper) {
    // ...
    try {
        await this.#config.onActorStart(actorId, generation, actorConfig);
        await this.#tunnel.restoreHibernatingRequests(
            actorId,
            startCommand.hibernatingRequestIds,
        );
    } catch (err) {
        // error handling
    }
}

The previous implementation spawned a background promise, but now it's awaited. This is actually better, but it changes the behavior. Make sure this doesn't cause timeouts or block other command processing.

4. Timeout Constants Need Review

const WEBSOCKET_OPEN_TIMEOUT: Duration = Duration::from_secs(15);
const TUNNEL_ACK_TIMEOUT: Duration = Duration::from_secs(5);

The WebSocket open timeout increased from 2s to 15s. While this might be necessary for hibernation, it could impact user experience if connections fail. Consider:

  • Making this configurable
  • Adding metrics to track actual open times
  • Documenting why 15s is chosen

Location: engine/packages/pegboard-gateway/src/lib.rs:37-38

5. BARE Schema Breaking Change

# rivetkit-typescript/packages/rivetkit/schemas/actor-persist/v3.bare

The BARE schema changed from v2 to v3. Verify there's proper migration logic for existing persisted actors, or document that this is a breaking change requiring actor recreation.

6. Excessive Debug Logging

Multiple new debug logs were added throughout the codebase:

  • engine/packages/pegboard-gateway/src/shared_state.rs - Many new debug statements
  • engine/packages/pegboard-gateway/src/lib.rs - WebSocket ack logging

While helpful for development, consider:

  • Using trace level for very frequent operations (e.g., every message ack)
  • Ensuring these don't impact production performance
  • Having a plan to clean up verbose logging after stabilization

Code Quality Issues ⚠️

1. Inconsistent Error Messages

// engine/sdks/typescript/runner/src/tunnel.ts:1038
// TODO: This will never retransmit the socket and the socket will close

This TODO indicates incomplete functionality that should be resolved.

2. Formatting Issues

// engine/packages/actor-kv/tests/list_edge_cases.rs

The test file has formatting changes that seem to be from cargo fmt. According to CLAUDE.md, you should not run cargo fmt automatically. These changes should probably be reverted or committed separately.

3. Magic Numbers

const HWS_MAX_PENDING_MSGS_SIZE_PER_REQ: u64 = util::size::mebibytes(1);

Good use of named constant, but consider making limits configurable per-actor or documenting why 1MB is the right limit.

Performance Considerations 🔄

1. Batching Optimization

The message index persistence uses smart batching (500KB threshold or 5s debounce), which is good. However:

  • Document these values (why 500KB?)
  • Consider making them tunable
  • Add metrics to measure actual batch sizes

2. Map Lookups

Multiple nested map lookups on hot paths:

actor.pendingRequests.get(...)
actor.webSockets.get(...)
actor.pendingTunnelMessages.get(...)

This is fine for normal loads, but consider performance implications at scale.

Security Concerns 🔒

1. Request ID Validation

When restoring hibernating connections, ensure request IDs from the gateway are validated and can't be spoofed to access other actors' connections.

2. Message Size Limits

Good to see HWS_MAX_PENDING_MSGS_SIZE_PER_REQ limit. Verify this is enforced server-side and can't be bypassed.

Testing Gaps 📋

  1. Edge Cases to Consider:

    • What happens if an actor crashes during hibernation?
    • Behavior when message ack timeout expires during high load
    • Connection restoration when storage is unavailable
    • Concurrent hibernation/wake cycles
  2. Load Testing:

    • Test with many concurrent hibernating connections
    • Verify memory usage doesn't grow unbounded
    • Test rapid hibernate/wake cycles

Recommendations 📝

Before Merge (Critical):

  1. ✅ Resolve the TODO at tunnel.ts:397
  2. ✅ Add #requestToActor cleanup in actor stop flow
  3. ✅ Verify BARE schema migration or document breaking change
  4. ✅ Resolve the retransmit TODO at tunnel.ts:1038
  5. ✅ Remove or address formatting-only changes in list_edge_cases.rs

Soon After Merge:

  1. Add metrics for hibernation success/failure rates
  2. Add metrics for message ack latency
  3. Consider making timeout values configurable
  4. Add load tests for hibernation at scale
  5. Clean up verbose debug logging

Nice to Have:

  1. Document performance characteristics in the hibernation docs
  2. Add examples of proper hibernation configuration
  3. Consider adding a hibernation diagnostic tool

Summary

This is a well-architected feature with good documentation and test coverage. The hibernation mechanism is complex but appears to handle the key scenarios correctly. The main concerns are:

  1. Outstanding TODOs that indicate incomplete functionality
  2. Potential memory management issues
  3. Need for additional edge case testing

Recommendation: Address the critical TODOs and memory cleanup issue before merging. The other concerns can be addressed in follow-up PRs, but should be tracked.


Great work on tackling this complex feature! The sequence diagrams and documentation make it much easier to understand the intricate state management involved. 🚀

@MasterPtato MasterPtato force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from 7410ff0 to f50891c Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Hibernating WebSocket Protocol Implementation

This is a substantial PR implementing hibernating WebSocket support, a complex feature that allows WebSocket connections to persist while actors sleep. I've reviewed the changes across Rust backend, TypeScript runner SDK, and RivetKit packages. Overall, the implementation is solid with good test coverage, though I have some observations and recommendations.


Positive Highlights

1. Excellent Documentation

  • The HIBERNATABLE_CONNECTIONS.md file provides clear sequence diagrams explaining the lifecycle
  • Well-structured documentation of the three-phase restoration process
  • Clear differentiation between ephemeral and hibernatable connections

2. Comprehensive Test Coverage

  • New hibernation test suite in actor-hibernation.ts covers key scenarios:
    • Basic connection hibernation
    • State persistence through sleep/wake cycles
    • Connection closure during hibernation
    • Multiple connection scenarios
  • Edge case testing in list_edge_cases.rs for KV operations

3. Robust Error Handling

  • Proper handling of zombie connections (connected but not persisted)
  • Stale connection cleanup (persisted but not connected)
  • Message deduplication using wrapping u16 indices
  • Graceful degradation on restoration failures

4. Clean Architecture

  • Good separation of concerns between StateManager, ConnectionManager, and Tunnel
  • Clear distinction between ephemeral and hibernatable connection data
  • Use of symbols (HIBERNATABLE_SYMBOL, CONN_ACTOR_SYMBOL) for internal state

Code Quality Issues

1. Timeout Constants Need Consistency Review ⚠️

engine/packages/pegboard-gateway/src/lib.rs:37-38:

const WEBSOCKET_OPEN_TIMEOUT: Duration = Duration::from_secs(15);
const TUNNEL_ACK_TIMEOUT: Duration = Duration::from_secs(5);

engine/packages/pegboard-gateway/src/shared_state.rs:17-18:

const MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);
const HWS_MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);

Issue: WEBSOCKET_OPEN_TIMEOUT increased from 2s to 15s (7.5x increase), which seems aggressive. Meanwhile, MESSAGE_ACK_TIMEOUT and HWS_MESSAGE_ACK_TIMEOUT are identical (30s).

Recommendation:

  • Document why 15s is needed for WebSocket open
  • Consider if MESSAGE_ACK_TIMEOUT and HWS_MESSAGE_ACK_TIMEOUT should be different values, or consolidate them
  • Add comments explaining the timeout strategy

2. Excessive Logging May Impact Performance 📊

Multiple files have very verbose debug logging:

  • shared_state.rs:234-240 - logs every tunnel ack
  • shared_state.rs:243-261 - logs every message forward with nested debug statements
  • shared_state.rs:295-309 - logs every ack send result
  • shared_state.rs:398-407 - logs every websocket message ack

Issue: In high-throughput scenarios with many concurrent WebSocket connections, this could generate excessive log volume and impact performance.

Recommendation:

  • Consider using trace-level for message-level operations
  • Use debug-level only for connection lifecycle events
  • Add metrics/counters instead of logging every message
  • Consider sampling for high-frequency events

Example refactor:

// Instead of logging every ack
tracing::trace!(request_id=?Uuid::from_bytes(request_id), "received tunnel ack");

// Use metrics
metrics::counter!("tunnel.acks_received").increment(1);

3. Missing Error Context in GC Loop 🔍

shared_state.rs:502-508:

if now.duration_since(earliest_pending_msg.send_instant)
    <= MESSAGE_ACK_TIMEOUT
{
    break 'reason Some(MsgGcReason::MessageNotAcked{message_id:Uuid::from_bytes(earliest_pending_msg.message_id)});
}

Issue: The GC reason includes the message ID, but when this triggers cleanup, the log message doesn't include enough context about what was pending or why it timed out.

Recommendation:

  • Add structured logging when GC triggers to include message age, request context
  • Track how many messages were pending when timeout occurred

4. Race Condition Risk in Restoration ⚠️

engine/sdks/typescript/runner/src/tunnel.ts:134-158:

for (const [requestIdStr, requestId] of requestIdMap) {
    const meta = metaMap.get(requestIdStr);
    
    if (!meta) {
        // Connected but not loaded (not persisted) - close it
        this.log?.warn({
            msg: "closing websocket that is not persisted",
            requestId: requestIdStr,
        });
        
        this.#sendMessage(requestId, {
            tag: "ToServerWebSocketClose",
            val: {
                code: 1000,
                reason: "ws.meta_not_found_during_restore",
                hibernate: false,
            },
        });

Issue: The comment says "This may happen if" but the sentence is incomplete. This suggests there's an understood but undocumented race condition scenario.

Recommendation:

  • Complete the comment explaining when this occurs
  • Document if this is expected during normal operation or indicates a bug
  • Consider if immediate close is appropriate or if retry logic should be attempted

5. Potential Integer Overflow in Message Buffering 🐛

shared_state.rs:186-192:

if let (Some(hs), Some(ws_msg_index)) = (&mut req.hibernation_state, ws_msg_index) {
    hs.total_pending_ws_msgs_size += message_serialized.len() as u64;
    
    if hs.total_pending_ws_msgs_size > HWS_MAX_PENDING_MSGS_SIZE_PER_REQ
        || hs.pending_ws_msgs.len() >= u16::MAX as usize
    {
        return Err(WebsocketPendingLimitReached {}.build());

Issue: While overflow is unlikely (u64 max is huge), the check happens after the increment. If a single message is > u64::MAX - total_pending_ws_msgs_size, this could panic.

Recommendation:

// Check before adding
let new_size = hs.total_pending_ws_msgs_size.saturating_add(message_serialized.len() as u64);
if new_size > HWS_MAX_PENDING_MSGS_SIZE_PER_REQ
    || hs.pending_ws_msgs.len() >= u16::MAX as usize
{
    return Err(WebsocketPendingLimitReached {}.build());
}
hs.total_pending_ws_msgs_size = new_size;

6. Incomplete Error Handling in Background Tasks 🔧

shared_state.rs:276-279:

tokio::spawn(async move {
    match ups_clone
        .publish(

Issue: The spawned task logs errors but doesn't track failed acks. If acks consistently fail, the connection will eventually be GC'd without visibility into why.

Recommendation:

  • Add metrics for failed ack publishes
  • Consider alerting on high ack failure rates
  • Add structured context to the error logs

Security Considerations

1. Message Replay Protection

The implementation correctly handles message deduplication using wrapping u16 indices. The logic in websocket-tunnel-adapter.ts:136-159 properly detects and ignores duplicate messages.

Good: This protects against accidental replay after restoration failures.

2. Memory Exhaustion Protection

HWS_MAX_PENDING_MSGS_SIZE_PER_REQ (1 MiB) and u16::MAX message count limits provide protection against unbounded memory growth.

Consideration: Document what happens to client if they hit this limit. Does the WebSocket close? Is there backpressure?

3. Stale Metadata Cleanup

Good handling of stale/zombie connections during restoration prevents connection leaks.


Performance Considerations

1. State Persistence Strategy 📈

The debounced persistence strategy with 500KB threshold and 5000ms max wait (HIBERNATABLE_CONNECTIONS.md:147-153) is reasonable, but:

Question: Has this been load-tested with:

  • High message rate scenarios (100+ msg/sec)?
  • Many concurrent hibernatable WebSockets (1000+)?
  • Large connection state objects?

Recommendation: Add performance benchmarks to the test suite.

2. Message Acknowledgment Overhead

Every hibernatable WebSocket message requires an ack round-trip. In high-frequency messaging scenarios, this could become a bottleneck.

Recommendation:

  • Document expected throughput limits for hibernatable WebSockets
  • Consider batching acks if multiple messages arrive within a short window
  • Add metrics to track ack latency

Potential Bugs & Edge Cases

1. Missing Index Wrap Around Documentation 📝

engine/sdks/typescript/runner/src/utils.ts:57-194 implements wrapping arithmetic for u16 indices:

export function wrappingAddU16(a: number, b: number): number {
	return (a + b) & 0xffff;
}

Issue: This is correct but has subtle edge cases around the wrap boundary (65535 -> 0). The tests in utils.test.ts are good, but there's no documentation explaining the wrapping semantics.

Recommendation:

  • Add JSDoc comments explaining the wrapping behavior
  • Document what happens at exactly 65536 messages (does connection need to restart?)

2. Error Message Consistency

engine/artifacts/errors/actor.not_found.json:

"message": "The actor does not exist or was destroyed."

engine/packages/pegboard/src/errors.rs:

#[error("not_found", "The actor does not exist or was destroyed.")]
NotFound,

Good: These are now consistent. However, from a user perspective, "does not exist" vs "was destroyed" are very different states. Consider splitting into two error codes for better debugging.

3. Hibernation State Race Condition

shared_state.rs:121-124:

if entry.stopping {
    entry.hibernation_state = None;
    entry.stopping = false;
}

Question: What prevents a race where:

  1. Request starts stopping
  2. New message arrives
  3. Hibernation state is cleared
  4. Message is lost?

Recommendation: Add comments explaining the synchronization guarantees or add additional checks.


Testing Gaps

1. Missing Concurrent Test Scenarios

  • No tests for multiple actors hibernating simultaneously
  • No tests for actor restart while messages are in flight
  • No tests for network partition scenarios (gateway loses connection to runner)

2. Message Ordering Tests

While deduplication is tested, there are no explicit tests for:

  • Message ordering preservation across hibernation
  • Out-of-order message handling (if possible)

3. Performance/Load Tests

Consider adding tests for:

  • 1000 concurrent hibernatable WebSockets
  • High message rate (100 msg/sec) scenarios
  • Large connection state (near CBOR limits)

Minor Issues

1. Typo in Comment

websocket-tunnel-adapter.ts:25:

// mARK: - Internal State

Should be // MARK: - Internal State

2. Formatting Inconsistency

actor-kv/tests/list_edge_cases.rs has many formatting changes that appear to be from cargo fmt. While these improve consistency, they make the diff harder to review.

3. Removed Files

hono-websocket-adapter.ts was deleted (393 lines). Ensure this wasn't needed elsewhere or document why it's safe to remove.


Recommendations Summary

High Priority

  1. Fix potential integer overflow in message buffering (use saturating_add)
  2. Complete the incomplete comment in restoration race condition
  3. Review and document timeout constants strategy
  4. Add metrics for ack failures and high-frequency events

Medium Priority

  1. Reduce debug logging verbosity for high-frequency operations
  2. Add performance benchmarks for hibernatable WebSockets
  3. Document wrapping arithmetic edge cases
  4. Add concurrent/race condition tests

Low Priority

  1. Fix typo in MARK comment
  2. Consider splitting "not found" vs "destroyed" error codes
  3. Document expected throughput limits

Conclusion

This is a well-implemented feature with solid architecture, good test coverage, and thoughtful error handling. The hibernation mechanism is complex but the code is generally clean and well-structured. The main concerns are around performance observability (logging/metrics), timeout tuning, and edge case documentation.

Recommendation: Approve with minor revisions - address the high-priority items before merging.

Great work on implementing this complex feature! The sequence diagrams and documentation make it much easier to understand the lifecycle.

@NathanFlurry NathanFlurry force-pushed the 11-18-fix_actor-kv_fix_listing_keys branch from 6d04024 to d4f63d9 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from f50891c to 7410ff0 Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Hibernating WebSocket Protocol Implementation

This is a substantial and well-architected PR that implements hibernatable WebSocket connections. I've reviewed the code across both Rust (engine/gateway) and TypeScript (runner/RivetKit) components. Overall, the implementation is solid with good attention to detail. Below are my findings:


✅ Strengths

Architecture & Design

  • Clean separation of concerns: The hibernation logic is well-encapsulated in dedicated components (StateManager, ConnectionManager, Tunnel)
  • Comprehensive documentation: The HIBERNATABLE_CONNECTIONS.md provides excellent sequence diagrams showing the lifecycle
  • Message sequencing: Proper handling of wrapping u16 message indices with validation to detect skipped messages
  • Two-phase GC: Smart approach to ensure cleanup doesn't prematurely remove connections before close messages are sent

Error Handling

  • Robust validation: Message index validation catches out-of-order and duplicate messages
  • Graceful degradation: Proper handling of zombie connections (connected but not persisted) and stale connections (persisted but not connected)
  • Error boundaries: Good use of try-catch blocks with appropriate logging and cleanup

Observability

  • Excellent logging: Structured logging with contextual information throughout (request IDs, message IDs, indices)
  • Debug traces: Helpful debug logs for troubleshooting hibernation flow

Test Coverage

  • Comprehensive tests: Good coverage of hibernation scenarios including state persistence, connection restoration, and edge cases
  • Lifecycle verification: Tests properly verify that onConnect/onDisconnect are NOT called during sleep/wake cycles

⚠️ Issues & Concerns

1. Critical: Logic Bug in Timeout Check (engine/packages/pegboard-gateway/src/shared_state.rs:502-505)

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

Bug: This checks if the message is WITHIN the timeout and flags it for GC. Should use > instead of <=.

Impact: This will cause messages to be incorrectly garbage collected immediately after being sent, breaking the entire hibernation protocol.

Fix:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{message_id:Uuid::from_bytes(earliest_pending_msg.message_id)});
}

Same issue on line 512-515 for WebSocket messages.

2. Critical: Inconsistent Actor Check Logic (engine/sdks/typescript/runner/src/tunnel.ts:830)

if (this.#runner.hasActor(actorId, generation)) {
    this.log?.warn({
        msg: "actor not loaded to send response, assuming gateway has closed request",
        // ...
    });
    return;
}

Bug: This returns early when the actor EXISTS (hasActor returns true), which is backwards.

Impact: Responses won't be sent for active actors.

Fix: Should be if (!this.#runner.hasActor(actorId, generation))

Same issue on line 879.

3. Race Condition Risk: Async Command Handling (engine/sdks/typescript/runner/src/mod.ts:754-872)

async #handleCommands(commands: protocol.ToClientCommands) {
    for (const commandWrapper of commands) {
        if (commandWrapper.inner.tag === "CommandStartActor") {
            this.#handleCommandStartActor(commandWrapper); // Spawns promise
        } else if (commandWrapper.inner.tag === "CommandStopActor") {
            await this.#handleCommandStopActor(commandWrapper);
        }
    }
}

Issue: StartActor spawns async without awaiting, while StopActor awaits. If commands arrive in sequence [Start, Stop], the Stop could execute before Start completes, causing race conditions.

Recommendation: Either await both or document why this is intentional. Consider using a command queue.

4. Potential Memory Leak: Pending Save Promises (rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:208)

if (!this.#onPersistSavedPromise) {
    this.#onPersistSavedPromise = promiseWithResolvers();
}

Issue: The promise is created but never cleared after resolution. Multiple sequential saves could accumulate promises.

Fix: Clear the promise after resolving:

this.#onPersistSavedPromise?.resolve();
this.#onPersistSavedPromise = undefined;

5. Debug/Testing Code Left In Production (rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:479-521)

// Test: Check if KV data is immediately available after write
try {
    // Try kvListAll first
    if ("kvListAll" in this.#actorDriver && typeof this.#actorDriver.kvListAll === "function") {
        const kvEntries = await (this.#actorDriver as any).kvListAll(this.#actor.id);
        this.#actor.rLog.info({
            msg: "KV verification with kvListAll immediately after write",
            // ...
        });
    }
    // ...
}

Issue: This appears to be debugging/verification code that should be removed or gated behind a debug flag.

Recommendation: Remove or wrap in if (DEBUG_MODE) check.


🔍 Minor Issues

6. Timeout Configuration Inconsistency

  • WEBSOCKET_OPEN_TIMEOUT: 15s (lib.rs:37)
  • TUNNEL_ACK_TIMEOUT: 5s (lib.rs:38)
  • MESSAGE_ACK_TIMEOUT: 30s (shared_state.rs:17)
  • HWS_MESSAGE_ACK_TIMEOUT: 30s (shared_state.rs:18)
  • MESSAGE_ACK_TIMEOUT (tunnel.ts): 5s (line 21)

Observation: Different timeout values between Rust and TypeScript implementations. This might be intentional but should be documented.

7. Typo in Comment (engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:25)

// mARK: - Internal State

Should be // MARK: - Internal State

8. Error Message Clarity (engine/artifacts/errors/actor.not_found.json)

Changed from "The actor does not exist" to "The actor does not exist or was destroyed."

Suggestion: Consider separate error codes for not_found vs destroyed to help with debugging.

9. Commented Code (rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:524-540)

Large block of commented-out KV listing code. Either remove or add a TODO explaining why it's kept.


🎯 Recommendations

Performance

  1. Message batching: Consider batching WebSocket message acks to reduce network overhead
  2. Connection pooling: Verify reqwest client reuse (per CLAUDE.md guidelines) - looks good ✅

Security

  1. Rate limiting: No obvious rate limiting on hibernation/wake cycles - consider if this could be abused
  2. Max pending messages: Good limits in place (1MB per request, u16::MAX count) ✅

Testing

  1. Edge case coverage: Add tests for:
    • Message index wraparound at u16::MAX
    • Concurrent hibernation/wake cycles
    • Network partition scenarios
  2. Load testing: Test with high connection counts and rapid hibernation cycles

Documentation

  1. Inline comments: Add more comments explaining the two-phase GC mechanism
  2. Migration guide: Document any breaking changes to WebSocket API

📊 Code Quality Assessment

Category Rating Notes
Architecture ⭐⭐⭐⭐⭐ Excellent separation and design
Error Handling ⭐⭐⭐⭐ Robust, but has critical bugs to fix
Testing ⭐⭐⭐⭐ Good coverage, could use more edge cases
Documentation ⭐⭐⭐⭐⭐ Excellent sequence diagrams
Performance ⭐⭐⭐⭐ Solid, watch for ack batching opportunity
Security ⭐⭐⭐⭐ Good limits in place

🚨 Action Required Before Merge

  1. Fix critical timeout logic bug in shared_state.rs (Issue [SVC-2555] Set up issue templates #1)
  2. Fix actor check logic in tunnel.ts (Issue [SVC-2479] Send cluster events to PostHog #2)
  3. Review async command handling for race conditions (Issue [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)
  4. Fix memory leak in state-manager.ts (Issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  5. Remove or gate debug code in state-manager.ts (Issue [SVC-2358] Enable Redis memory overcommit #5)

✨ Overall Assessment

This is a high-quality PR with sophisticated distributed systems patterns. The core architecture is excellent, but there are critical bugs that must be fixed before merging. The hibernation protocol implementation is well-thought-out, and the test coverage demonstrates good understanding of the edge cases.

Recommendation: Request changes to fix critical issues, then approve.

Great work on tackling this complex feature! 🎉

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: Hibernating WebSocket Protocol Implementation

Overview

This PR implements a new hibernating WebSocket protocol that allows WebSocket connections to persist while actors go to sleep. This is a substantial change affecting both the Rust backend (pegboard-gateway, pegboard-runner) and TypeScript SDK (runner, rivetkit). The implementation includes comprehensive tests and documentation.

Strengths

Architecture & Design

  • Well-documented: The new HIBERNATABLE_CONNECTIONS.md provides excellent lifecycle diagrams showing connection flow, restoration, and message persistence
  • Comprehensive test coverage: New hibernation tests in actor-hibernation.ts cover basic hibernation, state persistence, and edge cases
  • Clean separation: The actor state is now properly separated between ephemeral and hibernatable connections using discriminated unions
  • Strong typing: New TypeScript interfaces like HibernatingWebSocketMetadata, PendingRequest, and PendingTunnelMessage provide clear contracts

Implementation Quality

  • Proper timeout handling: Added distinct timeouts for different operations (WEBSOCKET_OPEN_TIMEOUT: 15s, TUNNEL_ACK_TIMEOUT: 5s, HWS_MESSAGE_ACK_TIMEOUT: 30s)
  • Wrapping arithmetic: The new utility functions (wrappingGtU16, wrappingLteU16, etc.) with comprehensive tests ensure correct u16 index wrapping for message sequencing
  • Enhanced logging: Added structured logging throughout with proper context (request IDs, message IDs, indices)
  • Proper error handling: Error messages are clear and actionable (e.g., "ws.meta_not_found_during_restore", "ws.stale_metadata")

Issues & Recommendations

Critical Issues

1. Missing Error Handling in Async Operations

  • Location: engine/sdks/typescript/runner/src/mod.ts:915-973
  • The #handleCommandStartActor method now properly awaits onActorStart, but the spawned background promise in #handleCommands may silently swallow errors
if (commandWrapper.inner.tag === "CommandStartActor") {
    // Spawn background promise
    this.#handleCommandStartActor(commandWrapper);
  • Recommendation: Add proper error handling/logging for the background promise or make the entire command handling sequential if order matters

2. Timeout Not Enforced

  • Location: engine/sdks/typescript/runner/src/mod.ts:949
// TODO: Add timeout to onActorStart
  • Recommendation: Implement the timeout to prevent hung actors from blocking the system. Use Promise.race with a timeout promise.

3. Incomplete Crash Handling

  • Location: engine/sdks/typescript/runner/src/mod.ts:970
// TODO: Mark as crashed
// Send stopped state update if start failed
await this.forceStopActor(actorId, generation);
  • Recommendation: Implement proper crash state tracking so the system knows why an actor stopped (crashed vs intentional)

Performance Concerns

4. Unbounded Loop in Restoration

  • Location: engine/sdks/typescript/runner/src/tunnel.ts:136-222
  • The restoreHibernatingRequests method processes all connections sequentially with background operations
  • If an actor has many hibernated connections, this could cause startup delays
  • Recommendation: Consider batching operations or adding a concurrent limit with Promise.all on smaller chunks

5. Message Buffering Without Backpressure

  • Location: engine/packages/pegboard-gateway/src/shared_state.rs:189
if hs.total_pending_ws_msgs_size > HWS_MAX_PENDING_MSGS_SIZE_PER_REQ
    || hs.pending_ws_msgs.len() >= u16::MAX as usize
  • The system tracks pending message size but only checks the limit after adding the message
  • Recommendation: Check limit before accepting the message to prevent exceeding bounds

Code Quality Issues

6. Multiple Incomplete TODOs
Found several TODOs that should be addressed:

  • tunnel.ts:397: "TODO: Switch this with runner WS"
  • tunnel.ts:1038: "TODO: This will never retransmit the socket and the socket will close"
  • websocket-tunnel-adapter.ts:535: Empty TODO
  • Recommendation: Either implement these or create tracking issues if they're deferred work

7. HACK Comment Without Explanation

  • Location: rivetkit-typescript/packages/rivetkit/src/actor/router-websocket-endpoints.ts:264
// HACK: Close socket in order to fix bug with Cloudflare leaving WS in closing state
  • Recommendation: Add more context about the Cloudflare bug, when it occurs, and whether there's a better solution

8. Commented Out Code

  • Location: rivetkit-typescript/packages/rivetkit/src/actor/protocol/serde.ts:77
  • Multiple commented-out sections with TODOs
  • Recommendation: Remove if not needed or implement the cbor-x fix

Security & Correctness

9. Request ID Collision Risk

  • Location: engine/sdks/typescript/runner/src/tunnel.ts:52
/** Maps request IDs to actor IDs for lookup */
#requestToActor: Map<string, string> = new Map();
  • Request IDs are UUIDs converted to strings for map keys, which is fine
  • However, there's no validation that request IDs are unique across actor restarts
  • Recommendation: Add validation or documentation about how request ID uniqueness is guaranteed

10. Potential Race in State Persistence

  • Location: The documentation mentions debouncing message index persistence, but the implementation details are in driver-specific code
  • If an actor crashes between receiving a message and persisting the index, messages could be replayed
  • Recommendation: Document this as an expected behavior (at-least-once delivery) or implement additional safety mechanisms

Testing Gaps

11. Missing Edge Case Tests
While the tests are comprehensive, I'd suggest adding:

  • Tests for the GC behavior when messages timeout
  • Tests for concurrent restoration of multiple connections
  • Tests for what happens when hibernatableWebSocket.loadAll fails
  • Stress tests with the maximum pending message size/count

12. Test Timeout Values

  • Location: rivetkit-typescript/packages/rivetkit/src/driver-test-suite/tests/actor-hibernation.ts:28
await waitFor(driverTestConfig, HIBERNATION_SLEEP_TIMEOUT + 100);
  • Adding just 100ms buffer seems fragile for CI environments
  • Recommendation: Use a more generous buffer (e.g., + 500 or + 1000)

Documentation

13. Missing Migration Guide

  • This is a breaking change to the WebSocket protocol (evident from the openapi.json changes removing 277 lines)
  • Recommendation: Add migration notes for existing users about what changed and how to update their code

Minor Issues

  1. Formatting inconsistency: Some files have inconsistent spacing in object destructuring
  2. Import organization: Some files have all imports at the top (good) but a few have inline imports
  3. Error message consistency: Mix of capitalization in error messages ("Actor destroyed" vs "actor destroyed")

Positive Patterns to Note

  • ✅ Excellent use of discriminated unions for connection state (ConnData<CP, CS>)
  • ✅ Proper use of symbols for private/internal properties (HIBERNATABLE_SYMBOL, CONN_ACTOR_SYMBOL)
  • ✅ Good separation between protocol concerns and business logic
  • ✅ Comprehensive structured logging that will aid debugging

Overall Assessment

This is a well-architected and well-tested implementation of a complex feature. The hibernation protocol is thoughtfully designed with proper handling of restoration, message ordering, and connection lifecycle.

The main concerns are around:

  1. Completing the TODOs (especially timeout and crash handling)
  2. Ensuring all async operations have proper error handling
  3. Adding a migration guide for the breaking changes

Recommendation: Request changes to address the critical issues (#1, #2, #3) and TODOs before merging. The other issues can be addressed in follow-up PRs if time is tight.


Code Quality: 8/10
Test Coverage: 9/10
Documentation: 8/10
Security: 8/10

Great work on this complex feature! 🚀

@NathanFlurry NathanFlurry force-pushed the 11-18-fix_actor-kv_fix_listing_keys branch from 6d04024 to d4f63d9 Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch from f50891c to 7410ff0 Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: Hibernating WebSocket Protocol Implementation

This PR implements a new hibernating WebSocket protocol that allows WebSocket connections to persist while actors sleep. This is a substantial and well-architected change with generally high code quality. Below are my findings:


Strengths

1. Excellent Test Coverage

  • Added comprehensive hibernation tests in actor-hibernation.ts
  • Tests cover key scenarios: basic hibernation, state persistence, connection cleanup during hibernation
  • New fixture actor (hibernation.ts) with proper lifecycle tracking
  • Good use of test utilities and proper async handling

2. Well-Documented Architecture

  • Extensive documentation in HIBERNATABLE_CONNECTIONS.md (217 lines)
  • Clear inline comments explaining complex state transitions
  • Good JSDoc comments outlining requirements for hibernatable WebSocket handlers

3. Proper Error Handling Consolidation

  • Removed duplicate error definitions from guard/errors.rs
  • Consolidated to use pegboard::errors::Actor::NotFound
  • Improved error message: "The actor does not exist or was destroyed" (more accurate)

4. Thorough Logging

  • Added structured logging throughout using tracing on Rust side
  • Proper use of ?x format for structured fields
  • Good debug visibility for message acks, hibernation events, request lifecycle

⚠️ Issues & Concerns

1. Timeout Logic Inconsistency ⚠️

Location: engine/packages/pegboard-gateway/src/shared_state.rs:502-516

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

Problem: The logic is inverted. This checks if the message is within the timeout and flags it for GC. It should use >= MESSAGE_ACK_TIMEOUT or > MESSAGE_ACK_TIMEOUT.

Impact: Messages would be GC'd immediately instead of after timeout expires.

Same issue at line 512:

if now.duration_since(earliest_pending_ws_msg.send_instant) <= HWS_MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::WebSocketMessageNotAcked{...});
}

2. Missing Null Check in Runner 🔴

Location: engine/sdks/typescript/runner/src/mod.ts:915

async #handleCommandStartActor(commandWrapper: protocol.CommandWrapper) {
    if (!this.#tunnel) throw new Error("missing tunnel on actor start");

Problem: This function is called from #handleCommands, which is triggered by WebSocket messages. If the tunnel is missing, this will throw and potentially crash the process or leave actors in inconsistent state.

Recommendation:

  • Add this check at the beginning of #handleCommands
  • Or ensure #tunnel is always initialized before any commands are processed
  • Consider graceful degradation rather than throwing

3. Potential Memory Leak in Pending Messages ⚠️

Location: engine/packages/pegboard-gateway/src/shared_state.rs:186-203

If HWS_MAX_PENDING_MSGS_SIZE_PER_REQ (1 MiB) or u16::MAX messages are reached, the connection is terminated with WebsocketPendingLimitReached. However:

  • No logging when approaching the limit (e.g., 80% threshold)
  • No metrics/monitoring for tracking this condition
  • Consider adding warning logs when crossing 50% and 75% thresholds

4. Sequential Command Handling Changed to Parallel 🤔

Location: engine/sdks/typescript/runner/src/mod.ts:864-873

if (commandWrapper.inner.tag === "CommandStartActor") {
    // Spawn background promise
    this.#handleCommandStartActor(commandWrapper);
} else if (commandWrapper.inner.tag === "CommandStopActor") {
    await this.#handleCommandStopActor(commandWrapper);
}

Concern: StartActor is now fire-and-forget (no await), while StopActor is awaited. This could create race conditions if:

  • Multiple start commands arrive in quick succession
  • Start and Stop commands for the same actor arrive close together

Recommendation: Either await both or document why this asymmetry is intentional.

5. WEBSOCKET_OPEN_TIMEOUT Increase ⚠️

Location: engine/packages/pegboard-gateway/src/lib.rs:37-38

Changed from TUNNEL_ACK_TIMEOUT: 2s to WEBSOCKET_OPEN_TIMEOUT: 15s and TUNNEL_ACK_TIMEOUT: 5s.

Concern: 15 seconds is quite long for a WebSocket open operation. Consider:

  • Is this accounting for cold starts? If so, document it
  • Could this mask underlying performance issues?
  • Consider making this configurable

6. Error Handling in Ack Spawned Task

Location: engine/packages/pegboard-gateway/src/shared_state.rs:294-323

The ack message sending is spawned in a background task with error logging. If the ack consistently fails:

  • The runner will keep retrying messages
  • Could lead to message duplication or indefinite retry loops

Recommendation: Consider adding retry limits or circuit breaker pattern.


🔍 Code Quality Observations

Good Practices

  • ✅ Proper use of TypeScript types and interfaces
  • ✅ Good separation of concerns (state manager, connection manager, event manager)
  • ✅ Consistent error handling patterns following CLAUDE.md guidelines
  • ✅ No workspace dependency violations

Minor Issues

  • Some long functions that could be refactored (e.g., restoreHibernatingRequests is ~150 lines)
  • Consider extracting magic numbers to named constants (e.g., hibernation thresholds)

🧪 Test Coverage Assessment

Well covered:

  • Basic hibernation flow
  • State persistence through sleep/wake cycles
  • Connection lifecycle hooks
  • Cleanup during hibernation

Could use additional tests:

  • Race conditions (multiple messages during wake)
  • Timeout scenarios (message ack timeout, WebSocket open timeout)
  • Error recovery paths
  • Memory limit edge cases (approaching 1 MiB pending messages)
  • Concurrent actor start/stop commands

🔒 Security Review

No critical security issues found, but note:

  1. ✅ Proper input validation on message indices
  2. ✅ Memory limits in place for pending messages
  3. ✅ No obvious injection vulnerabilities
  4. ✅ Proper UUID handling throughout

📊 Performance Considerations

  1. Message Buffering: The 1 MiB limit per request is reasonable, but consider:

    • Adding metrics to track buffer utilization
    • Making this configurable per actor type
  2. GC Interval: 60-second GC interval is conservative. Monitor in production to see if this can be optimized.

  3. State Persistence: Good use of debouncing for message index persistence. Consider adding metrics for persistence latency.


📝 Recommendations

Critical (Must Fix)

  1. 🔴 Fix timeout comparison logic in shared_state.rs:502-516
  2. 🔴 Add proper null checks for tunnel in command handlers

High Priority (Should Fix)

  1. ⚠️ Add warning logs when approaching pending message limits
  2. ⚠️ Document or fix the async/await asymmetry in command handling
  3. ⚠️ Consider reducing WEBSOCKET_OPEN_TIMEOUT or document why 15s is needed

Nice to Have

  1. Add metrics for hibernation events, buffer utilization, and timeout occurrences
  2. Add integration tests for timeout scenarios
  3. Consider adding circuit breaker for failed ack messages
  4. Extract magic numbers to named constants

Summary

This is a well-implemented feature with good architecture, documentation, and test coverage. The main concerns are around the timeout logic bug which would cause immediate GC instead of waiting for timeout, and some error handling edge cases that could lead to inconsistent state.

Once the critical timeout bug is fixed and error handling is strengthened, this should be good to merge.

Estimated risk level: Medium (due to timeout logic bug)
Recommendation: Request changes for critical issues, then approve

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: Hibernating WebSocket Protocol Implementation

This PR implements a new hibernating WebSocket protocol that allows WebSocket connections to persist while actors go to sleep. This is a substantial change with ~4,100 additions and ~3,200 deletions across 69 files. Overall, the implementation is well-architected with good test coverage and documentation. Below are my findings:


✅ Strengths

1. Excellent Documentation

  • The new HIBERNATABLE_CONNECTIONS.md provides clear sequence diagrams showing the lifecycle of hibernating connections
  • The diagrams cover all critical phases: new connections, restoration, message persistence, and closure
  • Inline documentation in Runner.websocket callback clearly explains the three implementation requirements

2. Robust State Management

  • The implementation properly handles the three critical cases during restoration:
    • Connected AND persisted → restore connection
    • Connected but NOT persisted → close zombie connection
    • Persisted but NOT connected → close stale metadata
  • State persistence is debounced intelligently with both time-based (stateSaveInterval) and size-based thresholds (500KB)

3. Comprehensive Test Coverage

  • New hibernation tests cover key scenarios:
    • Basic connection hibernation
    • State persistence through hibernation cycles
    • Connection closure during hibernation
  • Tests verify lifecycle hooks are called correctly (onWake/onSleep vs onConnect/onDisconnect)
  • New utility tests for wrapping u16 comparisons are thorough

4. Good Logging

  • Extensive structured logging throughout the implementation
  • All critical events are logged with proper context (requestId, actorId, messageIndex)
  • Log levels are appropriate (debug for normal flow, warn for edge cases, error for failures)

5. Proper Error Handling

  • Gateway properly handles message ACK timeouts with separate timeouts for different message types
  • Failed restoration attempts are caught and connections are properly closed
  • Runner shutdown now rejects pending requests with a specific RunnerShutdownError

⚠️ Issues & Concerns

1. Potential Race Condition in Message Acknowledgment ⚠️

In engine/packages/pegboard-gateway/src/shared_state.rs:231-243, there's a potential race where a message could be acked after the in-flight request is already removed:

let Some(mut in_flight) =
    self.in_flight_requests.get_async(&msg.request_id).await
else {
    tracing::warn!(
        request_id=?Uuid::from_bytes(msg.request_id),
        message_id=?Uuid::from_bytes(msg.message_id),
        "in flight has already been disconnected, cannot ack message"
    );
    continue;  // ⚠️ Silent failure
};

Impact: The message acknowledgment is silently dropped. This could cause the gateway to keep messages buffered unnecessarily.

Recommendation: Track whether this is expected behavior or if there should be additional cleanup.

2. Missing Timeout for onActorStart ⚠️

In engine/sdks/typescript/runner/src/mod.ts:931, there's a TODO comment:

// TODO: Add timeout to onActorStart
await this.#config.onActorStart(actorId, generation, actorConfig);

Impact: If onActorStart hangs, the actor will never start and the runner could be blocked indefinitely.

Recommendation: Add a timeout (e.g., 30 seconds) with proper error handling and actor cleanup on timeout.

3. Commented-Out Index Skip 🔍

In engine/packages/pegboard-gateway/src/shared_state.rs:154:

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

Impact: Message index 0 is never used. While this doesn't break functionality, it could cause confusion during debugging.

Recommendation: Initialize last_ws_msg_index to u16::MAX so the first message gets index 0, or document why 0 is intentionally skipped.

4. Unbounded Background Operations 🔍

In engine/sdks/typescript/runner/src/tunnel.ts:131-207, background operations during restoration are tracked but could accumulate:

const backgroundOperations: Promise<void>[] = [];
// ... many operations added ...
await Promise.allSettled(backgroundOperations);

Impact: If there are thousands of connections, this could consume significant memory and create thundering herd issues.

Recommendation: Consider batching restoration operations (e.g., process 100 connections at a time).

5. Message ACK Timeout Inconsistency 🔍

Different timeout values are used for similar operations:

  • MESSAGE_ACK_TIMEOUT = 5000ms (runner/tunnel.ts:21)
  • MESSAGE_ACK_TIMEOUT = 30000ms (pegboard-gateway/shared_state.rs:17)
  • HWS_MESSAGE_ACK_TIMEOUT = 30000ms (pegboard-gateway/shared_state.rs:18)

Recommendation: Document why these timeouts differ or unify them if they should be consistent.

6. Magic Numbers in Code 🔍

Several magic numbers could be constants:

  • 500KB threshold for immediate message persistence (only in comment)
  • u16::MAX (65535) for connection limits
  • 1000 close code used in multiple places

Recommendation: Extract to named constants for maintainability.


🔒 Security Considerations

Good Security Practices

  1. Proper cleanup of zombie connections: Connections not properly persisted are closed with appropriate reason codes
  2. Message size limits: HWS_MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB prevents memory exhaustion
  3. Error messages don't leak sensitive info: Error messages are generic ("actor not found or was destroyed")

⚠️ Areas to Monitor

  1. DoS via hibernation: An attacker could create many hibernating connections to exhaust storage. Consider rate limiting or quotas.
  2. Message replay: If message indices aren't properly tracked, messages could be replayed. The wrapping u16 comparisons look correct, but extensive testing is needed.

🚀 Performance Considerations

Optimizations

  1. Debounced state persistence: Prevents excessive KV writes
  2. Parallel restoration: Operations run concurrently during restoration
  3. Message batching: Gateway buffers messages and acks in batches

⚠️ Potential Issues

  1. Large connection restoration: No backpressure if thousands of connections need restoration
  2. Unbounded pending messages: While there's a 1 MiB limit, this is per-request. With many concurrent requests, total memory could be high.
  3. GC interval: 15-second GC interval for shared state could lead to temporary memory buildup

📝 Code Quality

Follows Best Practices

  • Proper use of TypeScript types and interfaces
  • Good separation of concerns (StateManager, ConnectionManager, EventManager)
  • Consistent error handling patterns
  • Follows Rust conventions (structured logging, error types)

⚠️ Minor Issues

  1. TODO comments: 10+ TODO comments in the PR (see grep output above). While most are minor, they should be tracked.
  2. Typo in comment: "Remvoe" → "Remove" (line 6352 in diff)
  3. Commented-out code: Some sections have commented code that should be removed if no longer needed

🧪 Test Coverage

Well Tested

  • New hibernation test suite with 3 comprehensive tests
  • Tests verify state persistence, lifecycle hooks, and connection cleanup
  • Wrapping u16 comparison utilities have excellent test coverage (194 lines of tests)
  • Tests use proper async/await patterns and vitest utilities

⚠️ Gaps

  1. Edge cases: No tests for message index wrapping at u16::MAX
  2. Error scenarios: Limited tests for failures during restoration
  3. Performance: No tests for handling large numbers of hibernating connections
  4. Timeout scenarios: No tests for ACK timeout behavior

Recommendation: Add tests for:

  • Message index wrapping (indices near 65535)
  • Connection restoration failures
  • Gateway message buffer limits
  • ACK timeout and retry behavior

📋 Summary

This is a high-quality implementation of a complex feature. The architecture is sound, documentation is excellent, and test coverage is good. The main concerns are:

  1. Add timeout to onActorStart (high priority)
  2. Review race condition in message ACK (medium priority)
  3. Add more edge case tests (medium priority)
  4. Consider batching during restoration (low priority)
  5. Clean up TODO comments (low priority)

Recommendation: ✅ Approve with minor changes

The implementation is production-ready with the caveat that the onActorStart timeout should be added before merge to prevent potential deadlocks.


Great work on this feature! The hibernating WebSocket protocol will significantly improve resource efficiency for long-lived connections. 🎉

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-13-chore_rivetkit_implement_new_hibernating_ws_protocol branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants