Skip to content

Add E2E test retry jobs for PRs to handle flakiness#56581

Draft
cipolleschi wants to merge 5 commits intomainfrom
cipolleschi/retry-e2e-in-ci
Draft

Add E2E test retry jobs for PRs to handle flakiness#56581
cipolleschi wants to merge 5 commits intomainfrom
cipolleschi/retry-e2e-in-ci

Conversation

@cipolleschi
Copy link
Copy Markdown
Contributor

Summary

  • Adds retry jobs (up to 2 retries) for all 4 E2E test jobs on PRs to handle flakiness
  • Original E2E jobs use continue-on-error on PRs so failures don't block the workflow; on main, behavior is unchanged and the existing rerun-failed-jobs mechanism continues to work
  • Each retry runs on a fresh runner (addressing environment-level flakes) and is triggered via step-level outcome captured as a job output
  • Added overwrite: true to artifact uploads in maestro composite actions so retry jobs don't conflict on artifact names

How it works

  • On PRs: E2E job fails → retry_1 triggers → if that fails → retry_2 triggers. All have continue-on-error so the workflow stays green.
  • On main: continue-on-error is false, retry jobs are skipped (PR-only), and rerun-failed-jobs handles retries as before.

Known limitation

Since these are matrix jobs (Debug/Release), the job output uses the last-to-complete matrix combination's value. If only one flavor fails and the passing one finishes last, the retry may not trigger. In the common flakiness pattern (environment-level issues), both flavors tend to be affected, so this works well in practice.

Changelog:

[Internal] - Add E2E test retry jobs for PRs

Test plan

  • CI will validate the workflow syntax
  • On a PR where E2E tests pass: retry jobs should be skipped
  • On a PR where E2E tests fail due to flakiness: retry jobs should trigger and (ideally) pass on a fresh runner
  • On pushes to main: existing rerun-failed-jobs behavior is preserved (retry jobs are skipped)

On PRs, E2E tests (iOS/Android, RNTester/TemplateApp) now retry up to
2 additional times on failure. Each retry runs on a fresh runner to
address environment-level flakiness.

- Original E2E jobs use `continue-on-error` on PRs so failures don't
  block the workflow
- Step-level outcome is captured as a job output to trigger retries
- Retry jobs only run on `pull_request` events
- On `main`, behavior is unchanged: `continue-on-error` is false and
  the existing `rerun-failed-jobs` mechanism handles retries
- Added `overwrite: true` to artifact uploads in maestro composite
  actions so retry jobs don't fail on duplicate artifact names
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2026
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Move each E2E test job into its own reusable workflow with an internal
`report` job that reliably captures the test result across all matrix
combinations. This eliminates the matrix output race condition from the
previous approach and reduces test-all.yml by ~690 lines.

Each reusable workflow:
- Runs the matrix E2E tests in a `test` job
- Has a non-matrix `report` job that checks `needs.test.result` and
  exposes a `status` output (success/failure)

The callers in test-all.yml are now ~5 lines each instead of ~30-90.
Comment thread .github/workflows/e2e-android-rntester.yml Fixed
Comment thread .github/workflows/e2e-android-rntester.yml Fixed
Comment thread .github/workflows/e2e-android-templateapp.yml Fixed
Comment thread .github/workflows/e2e-android-templateapp.yml Fixed
Comment thread .github/workflows/e2e-ios-rntester.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
- Remove the PR-only guard from retry jobs so they also run on main
  and stable branches, providing consistent retry behavior everywhere
- Simplify rerun-failed-jobs to only handle Fantom tests, since E2E
  retries are now handled by the in-workflow retry_1/retry_2 jobs
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Comment thread .github/workflows/test-all.yml Fixed
Set minimal `contents: read` permissions to satisfy CodeQL security
analysis requirements.
Set top-level `contents: read` to satisfy CodeQL requirements.
The rerun-failed-jobs job gets a job-level override adding
`actions: write` since it needs to trigger retry-workflow.yml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. p: Facebook Partner: Facebook Partner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants