Skip to content

fix(atenet): retry transient upstream resets when routing to resumed actors#278

Open
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus
Open

fix(atenet): retry transient upstream resets when routing to resumed actors#278
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus

Conversation

@mccormickt

Copy link
Copy Markdown
Contributor

Adds a retry policy to the actor route so transient failures are retried until an actor is ready. This manifests commonly as 503 errors during the brief period between the request hitting envoy and the actor actually resuming and being able to serve requests. This change sets a circuit breaker to envoy's default of 20% of load, and a concurrency check of 20 as an arbitrary floor above the default of 3. Open to suggestions on these values from experience, or we can simply observe and iterate from here.

Reproduced with Claude on a kind cluster with the multi-template demo to inform the changes.

Fixes #218.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

…actors

When a request hits the router for a suspended actor, the ext_proc filter resumes the actor and rewrites :authority to the actor pod's IP:80, which the dynamic_forward_proxy cluster then connects to. In the brief window after resume returns but before the restored workload is accepting connections (or when a pooled connection to a just-suspended actor has gone stale), Envoy's upstream connection is reset before response headers. The actor route had no retry policy, so each such reset became an immediate 503 "upstream connect error or disconnect/reset before headers. reset reason: connection termination".

Add a retry policy to the actor route (retry_on "reset,connect-failure", 5 retries with 50ms-1s backoff) so these transient failures are retried once the listener is ready. A retry policy alone is not enough: every actor is routed through the single dynamic_forward_proxy cluster, whose retry circuit breaker defaults to only 3 concurrent retries cluster-wide, so a burst of concurrent requests to a just-resumed actor overflows it and the excess fails with 503 (UO) instead of retrying. Rather than inflate the static max_retries (which exists to cap retry amplification during an outage), configure a retry budget on the cluster: budget_percent 20% (Envoy's default) scales the allowed retries with load, with min_retry_concurrency 20 as a low-traffic floor above the default of 3. Other circuit breakers keep their defaults.

Reproduced on a kind cluster with the multi-template demo: concurrent requests to an actor immediately after suspend produced intermittent 503s with envoy response flags UC (connection termination) and UO (overflow). With the fix deployed (retry policy and retry budget verified live in the envoy config dump) the same protocol produced 0 failures across 1600+ requests.

Fixes agent-substrate#218.
@bowei

Copy link
Copy Markdown
Collaborator

I am wondering if we should be doing the retries here or add a ready notion to an Actor so we can tell if it can signal to the system that it's done resuming. One concern I have is that there are not going to be good ways to set the retry policy across a diverse set of Actors.

@mccormickt

Copy link
Copy Markdown
Contributor Author

I am wondering if we should be doing the retries here or add a ready notion to an Actor so we can tell if it can signal to the system that it's done resuming. One concern I have is that there are not going to be good ways to set the retry policy across a diverse set of Actors.

Fair point, this was really just a coarse default data-plane policy. So perhaps a more complete solution is:

  • A STATUS_READY state in Actors since STATUS_RUNNING only means the restore/run operation has succeeded (could also just be a more strict transition rather than another status)
  • A retry_policy per ActorTemplate for configuration per class of Actor. Should there be a default policy? If so, should it live in the template or in Envoy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent 503 connection termination errors when resuming suspended actors

2 participants