Increase e2e wait timeouts to reduce flakiness#4545
Conversation
|
not sure this PR will fix ALL flaky e2e tests, but let's see after merging |
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce flakiness in the Actions-hosted E2E workflow by increasing upper-bound wait limits in the E2E helper script, while keeping runs fast when conditions are met quickly.
Changes:
- Increase the scale set pod readiness wait timeout from 30s to 120s (env-overridable).
- Increase the autoscaling runner set cleanup wait timeout from 40s to 120s (env-overridable).
- Increase the workflow-run start wait budget from ~60s to ~180s by raising the retry ceiling (env-overridable).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9d33f4a to
43b32ca
Compare
|
Hey @Okabe-Junya, We should start adding larger runner machines, but increasing test timeouts is still good if you ask me so we should merge this PR anyway |
|
Thank you!
Yes, would be nice to consider this! Can we use specialized machines (provided by the community or GitHub company) on github.com/actions? Or is it better to start with a large runner? |
What
Raise wait timeouts in
test/actions.github.com/helper.shto reduce e2e flakiness. Each is overridable via an env var (defaults shown)Why
(gha) E2E Testsis flaky onmaster(~half of recent push runs fail). The common cause is short, fixed timeouts on operations whose latency depends on the GitHub Actions service and runner startup. Each wait exits as soon as its condition is met, so raising the ceilings only helps slow cases and does not slow normal runs.e.g.,