Skip to content

fix(ci): stabilize tracecontext job#5149

Open
MukundaKatta wants to merge 1 commit intoopen-telemetry:mainfrom
MukundaKatta:fix/tracecontext-ci-readiness
Open

fix(ci): stabilize tracecontext job#5149
MukundaKatta wants to merge 1 commit intoopen-telemetry:mainfrom
MukundaKatta:fix/tracecontext-ci-readiness

Conversation

@MukundaKatta
Copy link
Copy Markdown

Why

The tracecontext CI job is reported to flake intermittently (#5104). The integration script starts a Flask example server and then waits a hard-coded sleep 1 before running the W3C tests. On slow runners Flask is not yet listening when the tests start, and pytest fails with connection errors against 127.0.0.1:5000 (per the failed run referenced in the issue: https://github.com/open-telemetry/opentelemetry-python/actions/runs/24451853608/job/71442431200).

The issue author suggested adding readiness logic; this PR does exactly that.

What

In scripts/tracecontext-integration-test.sh:

  • Drop the racy sleep 1.
  • Add a wait_for_server function that polls 127.0.0.1:5000 for up to 30s with a 0.5s interval.
  • Bail out early (and surface a clear error) if the example server process exits before becoming ready.
  • Use Python (already required by the script) for the socket probe so we don't depend on nc/curl being installed on the runner image.
  • Move the trap onshutdown EXIT registration before the readiness wait so that a failed readiness probe still cleans up the background server.

No behavior change in the happy path; the only observable difference is faster startup (typically <1s) and deterministic failure messages instead of cryptic connection errors.

Tested

  • The diagnostic in the issue body links a CI run where the failure mode is exactly a connection error against the example server before it is bound to :5000. A port-readiness probe is the standard fix for this class of CI race.
  • Reviewed the example server (tests/w3c_tracecontext_validation_server.py) to confirm it serves on the default Flask port 127.0.0.1:5000.
  • Script remains POSIX /bin/sh compatible (no bashisms introduced).

Fixes #5104

Replace the racy 'sleep 1' before the W3C tracecontext tests with an active readiness probe against 127.0.0.1:5000. The fixed sleep was too short on slow CI runners and produced intermittent connection errors against the Flask example server.

Closes open-telemetry#5104
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 26, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: MukundaKatta / name: Mukunda Rao Katta (df4ba43)

Copy link
Copy Markdown
Member

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @MukundaKatta 🚀

@github-project-automation github-project-automation Bot moved this to Approved PRs in Python PR digest Apr 27, 2026
@MikeGoldsmith MikeGoldsmith added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Apr 27, 2026
@MikeGoldsmith MikeGoldsmith changed the title fix(ci): stabilize tracecontext job (closes #5104) fix(ci): stabilize tracecontext job Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Skip Changelog PRs that do not require a CHANGELOG.md entry

Projects

Status: Approved PRs

Development

Successfully merging this pull request may close these issues.

tracecontext job randomly fails in CI

3 participants