ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated#13
Conversation
Adds an `e2e-stand` job alongside the existing kind-based `e2e`. Runs a 1 control-plane / 3 worker Talos cluster on the Oracle KVM-capable runner (`oracle-vm-24cpu-96gb-x86-64`), installs blockstor, provisions zfs+lvm-thin pools, and runs three happy-path scenarios sequentially: no-drbd, toggle-disk, tiebreaker. Gated by the `e2e-stand` PR label so it stays opt-in until proven stable on the runner pool. Mirrors the kind e2e job's SSH breakpoint wiring with a distinct check-run-name. Always-on tear-down via `make down` in `if: always()`. 90-minute timeout, ~45-75 min expected wall time. Caches Talos Image Factory artifacts (kernel + initrd + schematic id) across runs so warm boots stay short. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
First-run blocker:
|
The Oracle CNCF pool runner (oracle-vm-24cpu-96gb-x86-64) does not expose /dev/kvm — confirmed by the preflight in the first job run. Github-hosted ubuntu-latest doesn't ship the DRBD kernel module. The only realistic real-DRBD+Talos CI runner is the existing blockstor dev stand, which has both. Switch the e2e-stand job to runs-on: [self-hosted, e2e-stand] and document one-time runner registration in docs/CI-RUNNER-SETUP.md. After the stand is registered as a self-hosted runner, every PR with the `e2e-stand` label spins up a Talos+QEMU cluster and runs the no-drbd/toggle-disk/tiebreaker subset against it. Workflow itself is unchanged — only the runner target moves. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Motivation
Today PR CI only runs the kind-based
e2ejob, which exercises blockstor against a mocked DRBD environment. Several recurring regressions (Bug 342, Bug 356, dispatcher witness-flip semantics) are only reachable on a real DRBD kernel module + ZFS / LVM-thin stack — kind cannot expose those. The dev-stand workflow (make up/make blockstor/make pools/make e2e) already brings up exactly that: a 1+3 Talos cluster on QEMU/KVM with the real DRBD layer wired through.This PR wires that same stand into CI as a new
e2e-standjob, gated by a PR label so it stays opt-in until the runner-pool wall-time / flake profile is well understood. Promote to always-on later.Scope (happy-path slice)
oracle-vm-24cpu-96gb-x86-64(the same CNCF Oracle pool the existing kinde2ejob uses ondebug-less PRs). Has/dev/kvmfor nested virt — the job preflights this and fails fast if absent.siderolabs/drbd+siderolabs/zfsextensions.TYPE=both(zfs on /dev/sda, lvm-thin on /dev/sdb).no-drbd— single-replica STORAGE-only layer-stack, no DRBD rendertoggle-disk— diskful ↔ diskless flipstiebreaker— TIE_BREAKER witness lifecycleLabel-gated opt-in
The job is wired so it only runs when the PR carries the
e2e-standlabel:Unlabelled PRs see no extra cost (the job is skipped entirely). Labelled PRs run on the Oracle KVM pool.
Failure handling
kubectl cluster-info dump+ per-nodetalosctl dmesg/ kernel logs / blockstor resources YAML get written under.work/ci-e2e/diagnostics/and uploaded as ae2e-stand-diagnosticsartifact.cozystack/breakpoint-actionwiring with a distinctcheck-run-name(Breakpoint Open (e2e-stand)) so a parallel breakpoint from the kind job doesn't clobber this one. Fires only whenvars.BREAKPOINT_ENDPOINTis set (skipped silently on forks).make downalways runs inif: always()tear-down so qemu / dnsmasq don't leak.Expected wall time
~45-75 min — Talos boot ~10 min cached / ~20 min cold, blockstor + pools install ~5 min, 3 scenarios ~10-20 min each.
timeout-minutes: 90cap.Files touched
.github/workflows/pull-request.yml— newe2e-standjob appended; nothing else changed.