Skip to content

ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated#13

Draft
kvaps wants to merge 2 commits into
mainfrom
ci/e2e-stand-real-drbd
Draft

ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated#13
kvaps wants to merge 2 commits into
mainfrom
ci/e2e-stand-real-drbd

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented May 22, 2026

Motivation

Today PR CI only runs the kind-based e2e job, which exercises blockstor against a mocked DRBD environment. Several recurring regressions (Bug 342, Bug 356, dispatcher witness-flip semantics) are only reachable on a real DRBD kernel module + ZFS / LVM-thin stack — kind cannot expose those. The dev-stand workflow (make up / make blockstor / make pools / make e2e) already brings up exactly that: a 1+3 Talos cluster on QEMU/KVM with the real DRBD layer wired through.

This PR wires that same stand into CI as a new e2e-stand job, gated by a PR label so it stays opt-in until the runner-pool wall-time / flake profile is well understood. Promote to always-on later.

Scope (happy-path slice)

  • Runner: oracle-vm-24cpu-96gb-x86-64 (the same CNCF Oracle pool the existing kind e2e job uses on debug-less PRs). Has /dev/kvm for nested virt — the job preflights this and fails fast if absent.
  • 1 control-plane + 3 workers, Talos v1.10.5 with siderolabs/drbd + siderolabs/zfs extensions.
  • 8 GB extra disk per worker (slimmed from dev-stand default 16 GB — only 3 scenarios run before tear-down).
  • Local Docker registry on the runner; blockstor / apiserver / satellite images built once and pushed there.
  • StoragePool provisioning runs TYPE=both (zfs on /dev/sda, lvm-thin on /dev/sdb).
  • Three e2e scenarios run sequentially:
    • no-drbd — single-replica STORAGE-only layer-stack, no DRBD render
    • toggle-disk — diskful ↔ diskless flips
    • tiebreaker — TIE_BREAKER witness lifecycle
  • Talos Image Factory artifacts (kernel + initrd + schematic id) are cached across runs so warm boots stay short.

Label-gated opt-in

The job is wired so it only runs when the PR carries the e2e-stand label:

runs-on: ${{ contains(github.event.pull_request.labels.*.name, 'e2e-stand') && 'oracle-vm-24cpu-96gb-x86-64' || 'ubuntu-latest' }}
if: contains(github.event.pull_request.labels.*.name, 'e2e-stand') && needs.detect-changes.outputs.code == 'true'

Unlabelled PRs see no extra cost (the job is skipped entirely). Labelled PRs run on the Oracle KVM pool.

Failure handling

  • On step failure: kubectl cluster-info dump + per-node talosctl dmesg / kernel logs / blockstor resources YAML get written under .work/ci-e2e/diagnostics/ and uploaded as a e2e-stand-diagnostics artifact.
  • SSH breakpoint: mirrors the kind e2e job's cozystack/breakpoint-action wiring with a distinct check-run-name (Breakpoint Open (e2e-stand)) so a parallel breakpoint from the kind job doesn't clobber this one. Fires only when vars.BREAKPOINT_ENDPOINT is set (skipped silently on forks).
  • make down always runs in if: always() tear-down so qemu / dnsmasq don't leak.

Expected wall time

~45-75 min — Talos boot ~10 min cached / ~20 min cold, blockstor + pools install ~5 min, 3 scenarios ~10-20 min each. timeout-minutes: 90 cap.

Files touched

  • .github/workflows/pull-request.yml — new e2e-stand job appended; nothing else changed.

Adds an `e2e-stand` job alongside the existing kind-based `e2e`. Runs a
1 control-plane / 3 worker Talos cluster on the Oracle KVM-capable
runner (`oracle-vm-24cpu-96gb-x86-64`), installs blockstor, provisions
zfs+lvm-thin pools, and runs three happy-path scenarios sequentially:
no-drbd, toggle-disk, tiebreaker.

Gated by the `e2e-stand` PR label so it stays opt-in until proven
stable on the runner pool. Mirrors the kind e2e job's SSH breakpoint
wiring with a distinct check-run-name. Always-on tear-down via
`make down` in `if: always()`. 90-minute timeout, ~45-75 min expected
wall time. Caches Talos Image Factory artifacts (kernel + initrd +
schematic id) across runs so warm boots stay short.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps added the e2e-stand Opt-in: run the real-DRBD Talos+QEMU stand e2e job label May 22, 2026
@gemini-code-assist
Copy link
Copy Markdown

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f719f0f-05fc-430e-b973-c2d95a2a6877

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/e2e-stand-real-drbd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kvaps
Copy link
Copy Markdown
Member Author

kvaps commented May 22, 2026

First-run blocker: /dev/kvm unavailable on oracle-vm-24cpu-96gb-x86-64

The e2e-stand job started on the Oracle runner but failed instantly at the Probe /dev/kvm preflight step (0-second runtime, exited non-zero on [[ ! -w /dev/kvm ]]). All downstream steps (qemu install, Talos boot, blockstor install, scenarios) were skipped.

Run: https://github.com/cozystack/blockstor/actions/runs/26316633554
Job: E2E (stand, real-DRBD) → step Probe /dev/kvm (failure at 23:26:07 UTC)

What this means

The Oracle oracle-vm-24cpu-96gb-x86-64 pool does not expose /dev/kvm to the runner user — at least not in the configuration the existing kind-based e2e job uses. Real-DRBD Talos boot needs KVM nested-virt; soft-emulation (TCG) was explicitly ruled out by the design plan (a 4-VM Talos cluster on TCG would blow well past the 90-min budget — probably 3+ hours per boot).

Per the plan, stopping here for direction

If /dev/kvm is unavailable on Oracle runner: don't fall back to soft-emulation. Update the PR with a status comment explaining the blocker and STOP. User will decide next step.

The run has been cancelled so it doesn't sit in the breakpoint step's 20-minute pause-idle grace period burning runner minutes.

Options for next iteration (not actioned — awaiting decision)

  1. Switch runs-on to a different label that does expose /dev/kvm (e.g. a self-hosted runner where KVM is already wired — would mirror what the dev stand uses today).
  2. File a request with the CNCF/Oracle runner-pool admin to expose /dev/kvm on oracle-vm-24cpu-96gb-x86-64. Some pools gate KVM behind a runner-config flag.
  3. Use an ARC (Actions Runner Controller) pod on an existing cluster with --device=/dev/kvm mounted in.

The workflow change itself is fine — e2e-stand was correctly skipped on the unlabelled jobs and correctly attempted on this labelled PR. The preflight caught the missing capability cleanly. Only the runner image needs adjustment.

The Oracle CNCF pool runner (oracle-vm-24cpu-96gb-x86-64) does not
expose /dev/kvm — confirmed by the preflight in the first job run.
Github-hosted ubuntu-latest doesn't ship the DRBD kernel module. The
only realistic real-DRBD+Talos CI runner is the existing blockstor
dev stand, which has both.

Switch the e2e-stand job to runs-on: [self-hosted, e2e-stand] and
document one-time runner registration in docs/CI-RUNNER-SETUP.md.

After the stand is registered as a self-hosted runner, every PR with
the `e2e-stand` label spins up a Talos+QEMU cluster and runs the
no-drbd/toggle-disk/tiebreaker subset against it. Workflow itself is
unchanged — only the runner target moves.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e-stand Opt-in: run the real-DRBD Talos+QEMU stand e2e job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant