ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated by kvaps · Pull Request #13 · cozystack/blockstor

kvaps · 2026-05-22T23:20:58Z

Motivation

Today PR CI only runs the kind-based e2e job, which exercises blockstor against a mocked DRBD environment. Several recurring regressions (Bug 342, Bug 356, dispatcher witness-flip semantics) are only reachable on a real DRBD kernel module + ZFS / LVM-thin stack — kind cannot expose those. The dev-stand workflow (make up / make blockstor / make pools / make e2e) already brings up exactly that: a 1+3 Talos cluster on QEMU/KVM with the real DRBD layer wired through.

This PR wires that same stand into CI as a new e2e-stand job, gated by a PR label so it stays opt-in until the runner-pool wall-time / flake profile is well understood. Promote to always-on later.

Scope (happy-path slice)

Runner: oracle-vm-24cpu-96gb-x86-64 (the same CNCF Oracle pool the existing kind e2e job uses on debug-less PRs). Has /dev/kvm for nested virt — the job preflights this and fails fast if absent.
1 control-plane + 3 workers, Talos v1.10.5 with siderolabs/drbd + siderolabs/zfs extensions.
8 GB extra disk per worker (slimmed from dev-stand default 16 GB — only 3 scenarios run before tear-down).
Local Docker registry on the runner; blockstor / apiserver / satellite images built once and pushed there.
StoragePool provisioning runs TYPE=both (zfs on /dev/sda, lvm-thin on /dev/sdb).
Three e2e scenarios run sequentially:
- no-drbd — single-replica STORAGE-only layer-stack, no DRBD render
- toggle-disk — diskful ↔ diskless flips
- tiebreaker — TIE_BREAKER witness lifecycle
Talos Image Factory artifacts (kernel + initrd + schematic id) are cached across runs so warm boots stay short.

Label-gated opt-in

The job is wired so it only runs when the PR carries the e2e-stand label:

runs-on: ${{ contains(github.event.pull_request.labels.*.name, 'e2e-stand') && 'oracle-vm-24cpu-96gb-x86-64' || 'ubuntu-latest' }}
if: contains(github.event.pull_request.labels.*.name, 'e2e-stand') && needs.detect-changes.outputs.code == 'true'

Unlabelled PRs see no extra cost (the job is skipped entirely). Labelled PRs run on the Oracle KVM pool.

Failure handling

On step failure: kubectl cluster-info dump + per-node talosctl dmesg / kernel logs / blockstor resources YAML get written under .work/ci-e2e/diagnostics/ and uploaded as a e2e-stand-diagnostics artifact.
SSH breakpoint: mirrors the kind e2e job's cozystack/breakpoint-action wiring with a distinct check-run-name (Breakpoint Open (e2e-stand)) so a parallel breakpoint from the kind job doesn't clobber this one. Fires only when vars.BREAKPOINT_ENDPOINT is set (skipped silently on forks).
make down always runs in if: always() tear-down so qemu / dnsmasq don't leak.

Expected wall time

~45-75 min — Talos boot ~10 min cached / ~20 min cold, blockstor + pools install ~5 min, 3 scenarios ~10-20 min each. timeout-minutes: 90 cap.

Files touched

.github/workflows/pull-request.yml — new e2e-stand job appended; nothing else changed.

Adds an `e2e-stand` job alongside the existing kind-based `e2e`. Runs a 1 control-plane / 3 worker Talos cluster on the Oracle KVM-capable runner (`oracle-vm-24cpu-96gb-x86-64`), installs blockstor, provisions zfs+lvm-thin pools, and runs three happy-path scenarios sequentially: no-drbd, toggle-disk, tiebreaker. Gated by the `e2e-stand` PR label so it stays opt-in until proven stable on the runner pool. Mirrors the kind e2e job's SSH breakpoint wiring with a distinct check-run-name. Always-on tear-down via `make down` in `if: always()`. 90-minute timeout, ~45-75 min expected wall time. Caches Talos Image Factory artifacts (kernel + initrd + schematic id) across runs so warm boots stay short. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

gemini-code-assist · 2026-05-22T23:21:03Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

coderabbitai · 2026-05-22T23:21:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f719f0f-05fc-430e-b973-c2d95a2a6877

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ci/e2e-stand-real-drbd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kvaps · 2026-05-22T23:30:35Z

First-run blocker: `/dev/kvm` unavailable on `oracle-vm-24cpu-96gb-x86-64`

The e2e-stand job started on the Oracle runner but failed instantly at the Probe /dev/kvm preflight step (0-second runtime, exited non-zero on [[ ! -w /dev/kvm ]]). All downstream steps (qemu install, Talos boot, blockstor install, scenarios) were skipped.

Run: https://github.com/cozystack/blockstor/actions/runs/26316633554
Job: E2E (stand, real-DRBD) → step Probe /dev/kvm (failure at 23:26:07 UTC)

What this means

The Oracle oracle-vm-24cpu-96gb-x86-64 pool does not expose /dev/kvm to the runner user — at least not in the configuration the existing kind-based e2e job uses. Real-DRBD Talos boot needs KVM nested-virt; soft-emulation (TCG) was explicitly ruled out by the design plan (a 4-VM Talos cluster on TCG would blow well past the 90-min budget — probably 3+ hours per boot).

Per the plan, stopping here for direction

If /dev/kvm is unavailable on Oracle runner: don't fall back to soft-emulation. Update the PR with a status comment explaining the blocker and STOP. User will decide next step.

The run has been cancelled so it doesn't sit in the breakpoint step's 20-minute pause-idle grace period burning runner minutes.

Options for next iteration (not actioned — awaiting decision)

Switch runs-on to a different label that does expose /dev/kvm (e.g. a self-hosted runner where KVM is already wired — would mirror what the dev stand uses today).
File a request with the CNCF/Oracle runner-pool admin to expose /dev/kvm on oracle-vm-24cpu-96gb-x86-64. Some pools gate KVM behind a runner-config flag.
Use an ARC (Actions Runner Controller) pod on an existing cluster with --device=/dev/kvm mounted in.

The workflow change itself is fine — e2e-stand was correctly skipped on the unlabelled jobs and correctly attempted on this labelled PR. The preflight caught the missing capability cleanly. Only the runner image needs adjustment.

The Oracle CNCF pool runner (oracle-vm-24cpu-96gb-x86-64) does not expose /dev/kvm — confirmed by the preflight in the first job run. Github-hosted ubuntu-latest doesn't ship the DRBD kernel module. The only realistic real-DRBD+Talos CI runner is the existing blockstor dev stand, which has both. Switch the e2e-stand job to runs-on: [self-hosted, e2e-stand] and document one-time runner registration in docs/CI-RUNNER-SETUP.md. After the stand is registered as a self-hosted runner, every PR with the `e2e-stand` label spins up a Talos+QEMU cluster and runs the no-drbd/toggle-disk/tiebreaker subset against it. Workflow itself is unchanged — only the runner target moves. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

kvaps added the e2e-stand Opt-in: run the real-DRBD Talos+QEMU stand e2e job label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated#13

ci(e2e-stand): real-DRBD Talos+QEMU job, label-gated#13
kvaps wants to merge 2 commits into
mainfrom
ci/e2e-stand-real-drbd

kvaps commented May 22, 2026

Uh oh!

gemini-code-assist Bot commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review skipped

Uh oh!

kvaps commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kvaps commented May 22, 2026

Motivation

Scope (happy-path slice)

Label-gated opt-in

Failure handling

Expected wall time

Files touched

Uh oh!

gemini-code-assist Bot commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

kvaps commented May 22, 2026

First-run blocker: /dev/kvm unavailable on oracle-vm-24cpu-96gb-x86-64

What this means

Per the plan, stopping here for direction

Options for next iteration (not actioned — awaiting decision)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 22, 2026 •

edited

Loading

First-run blocker: `/dev/kvm` unavailable on `oracle-vm-24cpu-96gb-x86-64`