Skip to content

Retry the initial WebSocket connect in CI instead of hard-failing on a transient blip #157

Description

@veksen

This was generated by AI during triage.

Summary

In CI mode the analyzer hard-fails the check on a transient initial WebSocket connect blip, with no retry. A one-off handshake failure to the Site API trips the check red and requires a manual re-run.

Detail

runInCI calls ApiClient.connect(...) once (src/main.ts). A failed WebSocket handshake throws, propagates to the top-level try/catch, and process.exit(1) → red check. Observed in the wild: Error: WebSocket connection failed.Process completed with exit code 1, which passed cleanly on a re-run with identical code (prod relay was momentarily unreachable).

This is the same transient-vs-terminal distinction already handled for the baseline fetch (fetchPreviousRun retries timeouts/5xx/network before giving up) and for ingest failures — but the initial connect has none.

Proposal

Give ApiClient.connect a couple of retries with backoff on transient connect/handshake failures (mirror fetchPreviousRun's retry shape) before failing the run. A genuine, persistent connect failure should still fail; a single blip should not.

Acceptance criteria

  • A transient connect failure retried within N attempts proceeds normally.
  • A persistent connect failure still fails the run after retries are exhausted, with a clear message.

Provenance

Observed on a Query-Doctor/Site "Merge to prod" CI run; diagnosed as a transient prod relay blip unrelated to the change under test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions