Skip to content

feat(pb): actor v1 to v2 migration#4548

Draft
MasterPtato wants to merge 1 commit into04-01-feat_envoy-client_fully_flesh_out_tunnel_implfrom
04-02-feat_pb_actor_v1_to_v2_migration
Draft

feat(pb): actor v1 to v2 migration#4548
MasterPtato wants to merge 1 commit into04-01-feat_envoy-client_fully_flesh_out_tunnel_implfrom
04-02-feat_pb_actor_v1_to_v2_migration

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 3, 2026

🚅 Deployed to the rivet-pr-4548 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud ❌ Build Failed (View Logs) Web Apr 3, 2026 at 3:42 am
website ❌ Build Failed (View Logs) Web Apr 3, 2026 at 3:42 am
mcp-hub ✅ Success (View Logs) Web Apr 3, 2026 at 3:42 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 3, 2026 at 3:42 am
ladle ❌ Build Failed (View Logs) Web Apr 3, 2026 at 3:41 am
kitchen-sink ❌ Build Failed (View Logs) Web Apr 3, 2026 at 3:41 am

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

PR Review: feat(pb): actor v1 to v2 migration

This PR implements actor v1 to v2 migration with a clear separation of the Reallocating state from Sleeping, along with several correctness fixes. Overall the approach is solid. A few things to flag:

Bugs / Correctness

1. Typo in description (double "before")config-schema.json and config/pegboard.rs

The description reads: "How long to wait after starting to attempt to reallocate before before setting actor to sleep." The word before appears twice.

2. Reallocating to Sleeping transition logicactor2/mod.rs

The check state.retry_backoff_state.last_retry_ts > *since_ts + actor_retry_duration_threshold() checks whether the last retry occurred more than threshold ms after since_ts. But if last_retry_ts is stale from a previous Reallocating epoch, the comparison could behave unexpectedly. If last_retry_ts is 0/unset, this never triggers. The intent seems to be "have we been retrying for more than N ms since entering this state", which might be better expressed as now > *since_ts + threshold with now captured via a GetTs activity for determinism. Worth validating against actual last_retry_ts update semantics.

3. Potential race in gateway v1 migration pathpegboard_gateway.rs

When migrate_sub.next() fires and we delegate to handle_actor_v2, the v1 workflow has dispatched v2 and emitted MigratedToV2, but may still be in the process of returning. Worth confirming that v1's pegboard_actor returns Ok(()) immediately after dispatching v2 without any further envoy interaction, so there is no brief window of contention between both workflows.

Style / Minor

4. Inconsistent now sourcing in reschedule_actorruntime.rs

allocate now returns AllocateOutput { allocation, now }, and handle_stopped correctly uses allocate_res.now. But reschedule_actor then calls a fresh GetTs activity after calling allocate internally. Since allocate returns now in its output, it could be reused to avoid an extra activity replay step.

Positive changes worth noting

  • Determinism fix: Replacing all direct util::timestamp::now() calls inside workflow code with ctx.activity(runtime::GetTsInput {}) is the right approach for workflow replay determinism. Good catch throughout mod.rs and runtime.rs.
  • Reallocating state extraction: Splitting Sleeping { attempting_reallocation: bool } into distinct Sleeping and Reallocating { since_ts } variants is a meaningful clarity improvement.
  • handleShutdown idempotency guard: The shuttingDown flag cleanly prevents double-shutdown.
  • res bug fix in connection.ts: The !res && !errored guard correctly prevents overwriting a previously-set result when a WebSocket error breaks the loop.
  • Missing .tag("actor_id", ...): Adding the actor ID tag when dispatching the v2 workflow from v1 is critical for correct workflow routing.
  • API endpoint fix: /actors2 to /actors in the remote manager driver was clearly a bug.
  • Runner pool metadata namespace fix: keys::subspace() to namespace::keys::subspace() in runner_pool_metadata_poller.rs looks like an important correctness fix.

Checklist (draft PR)

  • PR description template is unfilled — tests, motivation, and checklist should be completed before marking ready for review.
  • scripts/release/update_version.ts should be verified since the OpenAPI version bumped from 2.2.0 to 2.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant