fix(atenet): retry transient upstream resets when routing to resumed actors by mccormickt · Pull Request #278 · agent-substrate/substrate

Tommy McCormick (mccormickt) · 2026-06-18T22:01:10Z

Adds a retry policy to the actor route so transient failures are retried until an actor is ready. This manifests commonly as 503 errors during the brief period between the request hitting envoy and the actor actually resuming and being able to serve requests. This change sets a circuit breaker to envoy's default of 20% of load, and a concurrency check of 20 as an arbitrary floor above the default of 3. Open to suggestions on these values from experience, or we can simply observe and iterate from here.

Reproduced with Claude on a kind cluster with the multi-template demo to inform the changes.

Fixes #218.

Tests pass
Appropriate changes to documentation are included in the PR

…actors When a request hits the router for a suspended actor, the ext_proc filter resumes the actor and rewrites :authority to the actor pod's IP:80, which the dynamic_forward_proxy cluster then connects to. In the brief window after resume returns but before the restored workload is accepting connections (or when a pooled connection to a just-suspended actor has gone stale), Envoy's upstream connection is reset before response headers. The actor route had no retry policy, so each such reset became an immediate 503 "upstream connect error or disconnect/reset before headers. reset reason: connection termination". Add a retry policy to the actor route (retry_on "reset,connect-failure", 5 retries with 50ms-1s backoff) so these transient failures are retried once the listener is ready. A retry policy alone is not enough: every actor is routed through the single dynamic_forward_proxy cluster, whose retry circuit breaker defaults to only 3 concurrent retries cluster-wide, so a burst of concurrent requests to a just-resumed actor overflows it and the excess fails with 503 (UO) instead of retrying. Rather than inflate the static max_retries (which exists to cap retry amplification during an outage), configure a retry budget on the cluster: budget_percent 20% (Envoy's default) scales the allowed retries with load, with min_retry_concurrency 20 as a low-traffic floor above the default of 3. Other circuit breakers keep their defaults. Reproduced on a kind cluster with the multi-template demo: concurrent requests to an actor immediately after suspend produced intermittent 503s with envoy response flags UC (connection termination) and UO (overflow). With the fix deployed (retry policy and retry budget verified live in the envoy config dump) the same protocol produced 0 failures across 1600+ requests. Fixes agent-substrate#218.

Bowei Du (bowei) · 2026-06-25T05:49:55Z

I am wondering if we should be doing the retries here or add a ready notion to an Actor so we can tell if it can signal to the system that it's done resuming. One concern I have is that there are not going to be good ways to set the retry policy across a diverse set of Actors.

Tommy McCormick (mccormickt) · 2026-06-25T14:11:39Z

I am wondering if we should be doing the retries here or add a ready notion to an Actor so we can tell if it can signal to the system that it's done resuming. One concern I have is that there are not going to be good ways to set the retry policy across a diverse set of Actors.

Fair point, this was really just a coarse default data-plane policy. So perhaps a more complete solution is:

A STATUS_READY state in Actors since STATUS_RUNNING only means the restore/run operation has succeeded (could also just be a more strict transition rather than another status)
A retry_policy per ActorTemplate for configuration per class of Actor. Should there be a default policy? If so, should it live in the template or in Envoy?

Bowei Du (bowei) self-assigned this Jun 18, 2026

Tommy McCormick (mccormickt) force-pushed the push-omsqxovpxsus branch from 245b264 to de17521 Compare June 19, 2026 17:45

Tommy McCormick (mccormickt) force-pushed the push-omsqxovpxsus branch from de17521 to d69feeb Compare June 23, 2026 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(atenet): retry transient upstream resets when routing to resumed actors#278

fix(atenet): retry transient upstream resets when routing to resumed actors#278
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus

Tommy McCormick (mccormickt) commented Jun 18, 2026

Uh oh!

Bowei Du (bowei) commented Jun 25, 2026

Uh oh!

Tommy McCormick (mccormickt) commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Tommy McCormick (mccormickt) commented Jun 18, 2026

Uh oh!

Bowei Du (bowei) commented Jun 25, 2026

Uh oh!

Tommy McCormick (mccormickt) commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants