feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling by luthermonson · Pull Request #68 · ephpm/ephemerd

luthermonson · 2026-05-08T06:02:28Z

Summary

What started as a 10-line cfg.VM.Linux.CPUs/MemoryMB wiring fix grew into the full set of changes needed to run ephpm's E2E test suite on a self-hosted ephemerd runner. The test job spins up a KIND cluster (kindest/node) via dind inside the Hyper-V Linux VM and asserts that an HTTP request from a test pod reaches the ephpm Service — every layer of that stack had something missing.

The root-cause fix was a single ~12 KB kernel module (xt_nat.ko); everything else was needed to even get to the point where its absence was the only thing left blocking. See ## How we got here below for the layered history if you want it.

What changed, grouped

VM config wiring (the original intent — 9593519, a923d5e)

Pass cfg.VM.Linux.CPUs/MemoryMB from parsed config through startContainerRuntime into vm.LinuxVMConfig on macOS and Windows. Previously parsed but never reached the VM.
Wire disk_size_gb config and grow the VHDX on boot.

dind subsystem upgrades for KIND (188fa49, 9f64f2e, a3e62de, 2a77f76, 1ff999d, aafad9d, bfc40d7, 987ac2f)

Mount tmpfs for the nested containerd inside kindest/node (kindest/node refuses to start on overlayfs).
Bind-mount host ext4 for buildkit to enable overlayfs.
Provision /etc/resolv.conf with public DNS in containers; mount r/w for the kindest/node entrypoint.
Three iterations on docker exec -i stdin handling — io.Pipe to avoid FIFO race → overlayfs file redirect → streaming-conn detection — needed because kubectl exec and docker exec -i use stdin streaming through the dind exec path.
KIND-specific glue: /etc/hostname bind-mount, container labels + filter parsing in /containers/json, per-container port-forward proxy via setns into the runner's net namespace.

VM kernel modules for kube-proxy (e80ab96, fbc1e68, 6d67149, 88ebc24, 52bbad5 ← the one that mattered)

The kindest/node cluster came up but kube-proxy couldn't install Service rules. Each missing module's absence caused the same symptom (error sending request from the test pod) but a different iptables-restore failure underneath:

iptable_filter, iptable_mangle — base tables.
br_netfilter + bridge-nf-call-iptables sysctl — needed for iptables NAT to fire on bridge traffic; added a retry loop because the modprobe + sysctl race could leave the sysctl path missing.
ipt_REJECT + nf_reject_ipv4 — kube-proxy's first rule rejects empty-endpoint Services.
xt_nat — registers the DNAT/SNAT/NETMAP iptables targets. Without it, /proc/net/ip_tables_targets lacks DNAT, iptables-restore aborts atomically on the first KUBE-SVC -j DNAT line, and the entire hundred-rule transaction is lost — every Service ClusterIP black-holes. nf_nat (the NAT engine) was already loaded, but the iptables-side glue lives in xt_nat.ko as a separate ~12 KB module. This was the last blocker.

Runtime + build (d3824a6, fb96345)

Bind-mount /etc/hosts so localhost resolves via files in runner containers.
Refactor Windows VM asset embedding: ephemerd-linux is now separated from the initrd and appended via runtime cpio at boot, so Linux code changes no longer invalidate the initrd cache.

Debug tooling (52eefca, plus the existing cmd/ctrdebug/)

cmd/ctrdebug connects to the in-VM containerd over TCP for list/exec of containers across namespaces. exec routes through the in-VM HTTP debug-exec server on containerd_port+2 because containerd's FIFO cio can't cross the Windows host ↔ Linux VM boundary (named pipes vs Linux FIFOs).
This commit adds /namespaces and ns=* to the /list endpoint so callers don't need to grep the ephemerd log for the per-job dind namespace (ephemerd-dind-<runner-name>).
All of this only exists because finding xt_nat required live inspection of /proc/net/ip_tables_targets and lsmod inside a running kindest/node — there was no other way.

Tests (2fd839b, 739f339, 2b37cf7)

Cover pkg/dind/portforward proxy + pkg/vm/initrd cpio append.
pkg/dind/registry_e2e_test.go (pre-existing TestPushHandlerEndToEnd) was failing in CI with content digest sha256:48a9...: not found (the synthetic layer). Two attempts:
- 739f339 — added containerd.io/gc.ref.content.* labels to the staged manifest so containerd's GC would keep config + layer reachable. Did not change the failure.
- 2b37cf7 — added a post-stage cs.Info(ctx, digest) diagnostic that confirms the layer is visible in the buildkit namespace right after staging. CI went green with this present, but with t.Logf suppressed on success we can't tell from one run whether the read serves as an implicit barrier that prevents the flake or whether the test just happened to pass. Worth watching for re-flakes; may still need a real fix (probably an explicit leases.Create around the staging block).

CI hygiene (22ae2b5)

9 errcheck issues in cmd/ephemerd/debugexec_linux.go (unchecked fmt.Fprintf/Fprintln on HTTP response writes) — wrapped via a writef closure that logs to debug on failure.
1 staticcheck SA9003 empty-branch in pkg/dind/portforward_test.go — converted into a proper t.Errorf for unexpected error types.

How we got here

Leaving this in because the next person to extend this is going to hit the same kind of cascading failure, and the debug pattern matters:

kindest/node started, but the in-cluster API server's Service ClusterIP was unreachable from any pod. CoreDNS readiness 503, e2e test pod error sending request for url.
kubectl exec into kindest/node was hard because containerd's FIFO cio can't bridge Windows ↔ Linux VM — built ctrdebug + an in-VM HTTP debug-exec server (/list, /exec, later /namespaces).
With live access, kube-proxy logs showed iptables-restore aborting on Extension REJECT revision 0 not supported. Added ipt_REJECT + nf_reject_ipv4. Symptom stayed identical at the pod level, but kube-proxy's error changed to Extension DNAT revision 0 not supported.
cat /proc/net/ip_tables_targets listed REJECT REDIRECT MARK MASQUERADE ERROR — no DNAT. lsmod | grep nat showed nf_nat (engine) loaded but no xt_nat (target registration). Added it. DNAT appeared, iptables-legacy -t nat -A ... -j DNAT worked manually, and kube-proxy installed the full KUBE-SVC-*/KUBE-SEP-* chain hierarchy.
ephpm E2E results flipped from "every suite fails with network errors" to "~14 suites pass cleanly; remaining failures are app-side: 429 rate-limit + Call to undefined function ephpm_kv_get()" — both outside ephemerd's scope.

Test plan

On Windows (Hyper-V): trigger ephpm E2E run on a self-hosted ephemerd runner; kindest/node spins up; kube-proxy installs full KUBE-SVC-*/KUBE-SEP-* chains with DNAT targets; ~14 e2e test suites pass cleanly. Remaining failures are isolated to ephpm app code.
Verified live in kindest/node: /proc/net/ip_tables_targets lists DNAT, lsmod shows xt_nat, manual iptables-legacy -t nat -A ... -j DNAT succeeds.
CI: lint clean; pkg/dind tests pass; Build (Linux), Build (Windows amd64), Build (macOS arm64) all green. Build (Linux arm64) is queued waiting for the macOS-host ephemerd's JIT poller to come online — its Vz Linux VM is aarch64 and pkg/scheduler/scheduler.go:1225-1230 advertises [self-hosted, linux, arm64] based on host goruntime.GOARCH, so once the macOS ephemerd is polling, it will claim the job.
On macOS: set [vm.linux] cpus = 6, memory_mb = 8192, confirm the boot log reflects configured values.
On Linux: mage build and smoke run — unaffected (VM-related params ignored).

Caveats / follow-ups

The TestPushHandlerEndToEnd Info-diagnostic fix may be flake-masking rather than fixing. If it re-flakes, the real fix is probably an explicit leases.Create in the test to hold staged blobs alive across the staging→push boundary.
Build (Linux arm64) depends on the macOS-host ephemerd being up to register a JIT runner with the matching labels. If that host is offline, the job stays queued.

The [vm.linux] cpus and memory_mb values in config.toml were parsed but never passed through startContainerRuntime to vm.LinuxVMConfig, so the VM always started with SetDefaults() values (1 CPU, 4096 MB). Also fixes discarded errors on dispatch client Close().

The kindest/node Dockerfile declares VOLUME /var, which Docker backs with an ext4 volume. Ephemerd skips anonymous volumes, leaving /var on overlayfs. Nested containerd inside the KIND node then fails with "filesystem not supported as upperdir" when trying overlayfs-on-overlayfs. Fix by mounting tmpfs at /var/lib/containerd in the kindest/node init wrapper, copying the pre-loaded images before bind-mounting over.

Thread DiskSizeGB from config through startContainerRuntime to the LinuxVMConfig on all platforms. On Windows, resize the existing VHDX via Resize-VHD when the configured size exceeds the current max. The init script now runs resize2fs after mount to grow the ext4 filesystem online, handling disk expansion without a schema bump.

kubectl create -f - (used by KIND to apply kindnet CNI) hung forever because bytes.NewReader + cio.WithStreams races: the containerd IO goroutine drains the reader and closes the FIFO write-end before the shim connects it to the exec process, so the process sees immediate EOF and blocks reading stdin indefinitely. Use io.Pipe instead — data is held back until after proc.Start() returns, when the shim's FIFO plumbing is fully connected. The pipe writer goroutine then delivers the buffered stdin and closes.

The io.Pipe approach from the previous commit still hangs — the containerd exec FIFO mechanism is fundamentally unreliable for stdin delivery regardless of write timing. Generalize the cp-stdin direct-write approach: for any exec with stdin data, write it to a temp file in the container's overlayfs upperdir and wrap the command with shell redirection: sh -c '"$@" < /.ephemerd-stdin-XXX; s=$?; rm -f ...; exit $s' _ <orig-cmd> This completely bypasses containerd FIFOs for stdin. The process reads from the file via shell redirect, gets proper EOF, and the temp file is cleaned up after.

buildctl dial-stdio is a bidirectional streaming exec — buildx sends gRPC frames over the hijacked connection's stdin and reads responses on stdout. The file-redirect approach broke this by giving it only the initial 33-byte handshake then EOF. Detect streaming vs one-shot by checking whether the 5-second stdin read hit a timeout (streaming: client kept conn open) or EOF (one-shot: client closed write side). For streaming execs, use io.MultiReader(buffered, conn) as stdin so the initial bytes plus ongoing traffic flow through containerd's FIFOs. The FIFO race doesn't apply here since the reader never returns EOF until the client disconnects.

Buildkit inside a dind container failed to resolve registry-1.docker.io because the default resolv.conf pointed to [::1]:53 (no DNS server). Bind-mount a resolv.conf with 1.1.1.1 and 8.8.8.8 into every dind container, matching the /etc/hosts provisioning pattern.

The kindest/node entrypoint writes to /etc/resolv.conf at line 477. Read-only mount caused the container to exit immediately with error.

Buildkit's native snapshotter does full filesystem copies per layer, quickly exhausting the 50GB VM disk. By bind-mounting a directory from the host's real ext4 filesystem at /var/lib/buildkit, buildkit can detect and use overlayfs instead, storing only layer diffs.

… at boot The Linux ephemerd binary used to be cpio'd into the initrd at build time (mage download:initrdx86). That meant plain `go build -o ephemerd.exe` silently produced a Windows binary running yesterday's Linux code, because nothing regenerated the initrd. Several iterations of dind fixes were shipped this way before I noticed — every redeploy ran the old binary. Move ephemerd-linux into its own go:embed directive in pkg/vm, extract it to <DataDir>\vm\linux\ephemerd-linux on every service start, and append a tiny runtime-generated cpio (assets/ephemerd-linux) to the boot initrd before the VM boots. The kernel merges concatenated cpios into a single initramfs, so the init script's /assets/ephemerd-linux is the freshest binary every time. Initrd drops from ~190MB to ~7.6MB at build time.

…ward Four gaps in our fake Docker daemon prevented kind create cluster + kind load image-archive + tilt ci from working end-to-end. Each one blocked the next. 1. /etc/hostname bind-mount. systemd inside kindest/node reads /etc/hostname at boot and sets the kernel hostname from it. Without our bind mount the image default ("debuerreotype") wins, and kubeadm generates apiserver certs with that name in the SAN list. 2. Labels in /containers/json. KIND uses `docker ps --filter label= io.x-k8s.kind.cluster=<name>` to find its node containers. We stored labels at create time but never returned them in the list response, so the client filter (server-side via the filters query param) matched nothing and `kind load image-archive` ended up running `docker inspect ''` and failing with "invalid container name or ID". 3. Server-side filter support for both Docker SDK shapes — {"label":{"k=v":true}} (current SDK) and {"label":["k=v"]} (older). Without parsing the filter, we returned the full container set and kind's `-q` output wasn't what it expected. 4. Userspace TCP port-forward proxy. KIND publishes the cluster API on 127.0.0.1:<random> in the runner's namespace via -p <host>:6443. Docker normally sets up DNAT for this; we stored PortBindings but never wired anything. iptables-nft inside the runner needs kernel modules our VM doesn't carry, and the runner image has no iptables-legacy binary, so we run a Go TCP proxy: lock an OS thread, setns into the runner netns, net.Listen on 127.0.0.1:hostPort, forward each Accept to <containerIP>: <containerPort>. Goroutine stays in the runner's ns for its lifetime; the dial side reaches the kindest/node directly via the CNI bridge they share. Stop functions live on containerEntry and fire from cleanupContainer before the kill so in-flight kubectl calls fail fast.

…n runner Docker bind-mounts a per-container /etc/hosts at runtime; we only mounted resolv.conf. The actions-runner image's /etc/hosts is effectively empty, so Go programs that call net.Listen("tcp", "localhost:10350") (tilt does this for its CI dashboard) fall through to DNS and fail with "lookup localhost on 1.1.1.1:53: no such host". Add withHostsMount alongside withDNSMount with the same write-then-bind pattern, populated with the standard 127.0.0.1/::1 localhost block. Also hand the runner's netns to the dind server (SetRunnerNetNS) once task.Pid() is known so port-forward proxies install into the right namespace.

Add unit coverage for the new pieces from the KIND-support and Windows binary-embed refactors. Also fixes errcheck on pre-existing stdinPW.Close calls in pkg/dind/exec.go that CI flagged on this branch. pkg/vm/initrd_windows_test.go - writeCPIOEntry header/name/body layout and 4-byte alignment - writeCPIOEntry symlink mode emits link target as body - buildBootInitrd roundtrip: appended gzipped cpio contains assets/ephemerd-linux with the binary bytes intact, ends with TRAILER!!!, and the base initrd content is preserved as a prefix - error paths for missing base and missing binary pkg/dind/portforward_test.go - forwardConn bidirectional byte flow (write→target, target→write) - forwardConn on unreachable target closes the client side cleanly - startPortForwardProxy rejects empty + nonexistent netns paths - startPortForwardProxy end-to-end against /proc/self/ns/net so the setns + listen + accept + forward path runs in CI (skips if the sandbox denies setns into its own ns) pkg/dind/exec.go - log Close errors instead of dropping them (errcheck fix)

The kindest/node container runs kube-proxy in iptables mode to translate Kubernetes Service ClusterIPs to pod IPs. kube-proxy needs iptable_filter and iptable_mangle to install its FORWARD + MARK chains; without those, every pod-to-Service request hangs forever (each reqwest call burns its ~20s timeout). The ephpm pod itself is healthy throughout — the failure is the e2e test pod's HTTP calls to http://ephpm:8080/* timing out because Service routing is silently dead inside the cluster. Add iptable_filter, iptable_mangle, plus the xt_REDIRECT / xt_statistic / xt_recent matches that kube-proxy emits for NodePort handling and multi-backend load balancing. Initrd module count goes from 43 to 48.

…proxy The iptable_filter / iptable_mangle additions got kube-proxy started but its rules still didn't fire on pod traffic — without br_netfilter, the sysctl /proc/sys/net/bridge/bridge-nf-call-iptables doesn't exist, so packets crossing the CNI bridge inside kindest/node bypass iptables NAT entirely. CoreDNS then can't reach 10.96.0.1:443 to bootstrap (readiness probe goes 503), every test pod's DNS lookup fails, and tilt times out at 30 min waiting on ephpm-e2e:runtime. Add the br_netfilter module to initrdKernelModulesX86 and have the init script set bridge-nf-call-iptables=1 + bridge-nf-call-ip6tables=1 after the kernel modules load.

The first-pass insmod loop tried to load br_netfilter but the kernel warning "filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this" still showed at boot — likely a dependency-ordering race in the generic loop. Retry br_netfilter explicitly after the loop runs and check that /proc/sys/net/bridge/bridge-nf-call-iptables actually exists before echoing 1 into it. Log enabled/missing so the boot log surfaces this reliably instead of failing silently.

Confirmed via the new in-VM debug-exec endpoint by exec'ing into the live kindest/node and dumping kube-proxy's container logs: Warning: Extension REJECT revision 0 not supported, missing kernel module? iptables-restore: line 15 failed "Sync failed" ipFamily="IPv4" retryingTime="30s" kube-proxy emits a REJECT rule for "Service with no endpoints" as the first rule of its sync batch. Without xt_REJECT (which we never bundled), the very first rule fails, the entire iptables-restore aborts, and NO Service rules get installed — KUBE-SERVICES chain remains empty. Downstream effects (the entire test failure pattern): - 10.96.0.1:443 has no DNAT target → CoreDNS can't sync zones → readiness probe returns 503 - Test pod's DNS lookup for "ephpm" never resolves → every HTTP request fails with reqwest "error sending request" - tilt ci hits 30-min timeout on ephpm-e2e:runtime Add ipt_REJECT + nf_reject_ipv4 to initrdKernelModulesX86. Plus the debug-exec scaffolding that made this diagnosable: - cmd/ephemerd/debugexec_linux.go: HTTP server in worker mode that runs cio-attached exec inside any container in any namespace, streams stdout/stderr back over TCP. Avoids containerd cio FIFO mismatch (Linux FIFOs vs Windows named pipes) by terminating IO inside the VM. - cmd/ephemerd/debugexec_other.go: no-op stub for non-Linux builds. - pkg/runtime: expose Client() for the debug-exec server. - cmd/ctrdebug: replace cio.LogFile path-conversion bug with an HTTP client hitting containerd_port+2.

xt_nat.ko registers the DNAT, SNAT, and NETMAP iptables targets. Without it, /proc/net/ip_tables_targets lacks DNAT and kube-proxy's iptables-restore aborts at the first KUBE-SVC DNAT line with "Extension DNAT revision 0 not supported, missing kernel module". iptables-restore is atomic, so the whole transaction is lost and no Service ClusterIP rules ever install — every pod-to-Service request black-holes, CoreDNS readiness fails with 503, and the e2e test pod times out trying to reach the ephpm Service. nf_nat.ko (the NAT engine) was already loaded; xt_MASQUERADE.ko and xt_REDIRECT.ko register their respective targets in separate modules, which is why MASQUERADE worked but plain DNAT did not. Added to both the aarch64 (macOS Vz) and x86_64 (Windows Hyper-V) module lists. Verified live in kindest/node: /proc/net/ip_tables_targets now lists DNAT, manual `iptables-legacy -t nat -A ... -j DNAT` succeeds, and kube-proxy installs the full KUBE-SVC-*/KUBE-SEP-* chain hierarchy with DNAT rules pointing at backend pod IPs.

When debugging live kindest/node containers in dind, the actual namespace name is constructed from the runner's display name (e.g. ephemerd-dind-ephemerd-github-ephpm-fast_shannon) and isn't known until the job dispatches. Probing /list?ns=ephemerd returned count=0 because the containers were in a per-job namespace, with no way to enumerate which namespaces existed. Add /namespaces returning ctrdClient.NamespaceService().List() output, and extend /list to accept ns=* (or empty ns) to iterate every namespace. This makes ctrdebug usable for live cluster inspection without needing to grep the ephemerd log for the runner name first. Also extends the /exec 404 error to report which container IDs the namespace lookup did see, so the next 404 surfaces enough context to distinguish a wrong-namespace bug from a missing-container bug.

errcheck flagged 9 unchecked fmt.Fprintf/Fprintln calls in cmd/ephemerd/debugexec_linux.go. Wrap them via a writef closure that logs at debug level if the write fails — a broken HTTP connection is the only realistic source of failure here, and there's nothing useful to do beyond logging. Also wrap the previously _ =-ignored procIO.Close per the no-discard-errors rule. staticcheck SA9003 flagged an empty if branch in pkg/dind/portforward_test.go that was meant to verify the connection closed cleanly. Inverted the guard to err != nil && !EOF && !ErrClosedPipe and add a t.Errorf for genuinely unexpected error types — what the original branch was clearly trying to express.

TestPushHandlerEndToEnd staged a synthetic image (layer + config + manifest) into containerd's content store via content.WriteBlob, but wrote the manifest blob with no labels. containerd's GC only walks references via blob labels of the form containerd.io/gc.ref.content.*, so the image record kept the manifest reachable but the manifest's config + layer were unreferenced and eligible for collection. When a GC pass fired between staging and the /push call, the layer blob was deleted and containerd's resolver reported "content digest sha256:48a9...f88a: not found" (the layer's digest) — passing locally where GC happened not to run in the ~0.3s window, failing in CI where the schedule landed differently. Add gc.ref.content.config and gc.ref.content.l.0 labels to the manifest write so the full image graph is reachable from the image record.

The gc.ref label fix didn't change anything — same digest still reports "not found" mid-push. Add a diagnostic that runs Info on the layer and manifest digests immediately after staging, before the test invokes the HTTP push handler. The output will tell us whether the layer is visible in the buildkit-namespace content bucket at the moment of staging: - If Info returns NotFound: the WriteBlob path silently failed to register the digest in the namespace (commit went through but the blob-bucket entry wasn't created). - If Info succeeds but push still fails: something between stage and push (GC, lease expiry, namespace context drift) is dropping the visibility. Push the diagnostic to surface what CI actually sees, then revert once the root cause is clear.

luthermonson added 20 commits May 6, 2026 22:45

fix(dind): mount resolv.conf read-write for kindest/node entrypoint

aafad9d

The kindest/node entrypoint writes to /etc/resolv.conf at line 477. Read-only mount caused the container to exit immediately with error.

luthermonson changed the title ~~fix(vm): wire Linux VM cpus/memory_mb from config to runtime~~ feat: end-to-end KIND-on-dind support for self-hosted ephpm E2E runners May 13, 2026

luthermonson added 2 commits May 13, 2026 19:27

luthermonson changed the title ~~feat: end-to-end KIND-on-dind support for self-hosted ephpm E2E runners~~ feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling May 14, 2026

luthermonson merged commit 7fd9fb1 into main May 14, 2026
4 checks passed

luthermonson deleted the fix/vm-config-wiring branch May 14, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling#68

feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling#68
luthermonson merged 22 commits into
mainfrom
fix/vm-config-wiring

luthermonson commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luthermonson commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed, grouped

How we got here

Test plan

Caveats / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luthermonson commented May 8, 2026 •

edited

Loading