feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling#68
Merged
Merged
Conversation
The [vm.linux] cpus and memory_mb values in config.toml were parsed but never passed through startContainerRuntime to vm.LinuxVMConfig, so the VM always started with SetDefaults() values (1 CPU, 4096 MB). Also fixes discarded errors on dispatch client Close().
The kindest/node Dockerfile declares VOLUME /var, which Docker backs with an ext4 volume. Ephemerd skips anonymous volumes, leaving /var on overlayfs. Nested containerd inside the KIND node then fails with "filesystem not supported as upperdir" when trying overlayfs-on-overlayfs. Fix by mounting tmpfs at /var/lib/containerd in the kindest/node init wrapper, copying the pre-loaded images before bind-mounting over.
Thread DiskSizeGB from config through startContainerRuntime to the LinuxVMConfig on all platforms. On Windows, resize the existing VHDX via Resize-VHD when the configured size exceeds the current max. The init script now runs resize2fs after mount to grow the ext4 filesystem online, handling disk expansion without a schema bump.
kubectl create -f - (used by KIND to apply kindnet CNI) hung forever because bytes.NewReader + cio.WithStreams races: the containerd IO goroutine drains the reader and closes the FIFO write-end before the shim connects it to the exec process, so the process sees immediate EOF and blocks reading stdin indefinitely. Use io.Pipe instead — data is held back until after proc.Start() returns, when the shim's FIFO plumbing is fully connected. The pipe writer goroutine then delivers the buffered stdin and closes.
The io.Pipe approach from the previous commit still hangs — the containerd exec FIFO mechanism is fundamentally unreliable for stdin delivery regardless of write timing. Generalize the cp-stdin direct-write approach: for any exec with stdin data, write it to a temp file in the container's overlayfs upperdir and wrap the command with shell redirection: sh -c '"$@" < /.ephemerd-stdin-XXX; s=$?; rm -f ...; exit $s' _ <orig-cmd> This completely bypasses containerd FIFOs for stdin. The process reads from the file via shell redirect, gets proper EOF, and the temp file is cleaned up after.
buildctl dial-stdio is a bidirectional streaming exec — buildx sends gRPC frames over the hijacked connection's stdin and reads responses on stdout. The file-redirect approach broke this by giving it only the initial 33-byte handshake then EOF. Detect streaming vs one-shot by checking whether the 5-second stdin read hit a timeout (streaming: client kept conn open) or EOF (one-shot: client closed write side). For streaming execs, use io.MultiReader(buffered, conn) as stdin so the initial bytes plus ongoing traffic flow through containerd's FIFOs. The FIFO race doesn't apply here since the reader never returns EOF until the client disconnects.
Buildkit inside a dind container failed to resolve registry-1.docker.io because the default resolv.conf pointed to [::1]:53 (no DNS server). Bind-mount a resolv.conf with 1.1.1.1 and 8.8.8.8 into every dind container, matching the /etc/hosts provisioning pattern.
The kindest/node entrypoint writes to /etc/resolv.conf at line 477. Read-only mount caused the container to exit immediately with error.
Buildkit's native snapshotter does full filesystem copies per layer, quickly exhausting the 50GB VM disk. By bind-mounting a directory from the host's real ext4 filesystem at /var/lib/buildkit, buildkit can detect and use overlayfs instead, storing only layer diffs.
… at boot The Linux ephemerd binary used to be cpio'd into the initrd at build time (mage download:initrdx86). That meant plain `go build -o ephemerd.exe` silently produced a Windows binary running yesterday's Linux code, because nothing regenerated the initrd. Several iterations of dind fixes were shipped this way before I noticed — every redeploy ran the old binary. Move ephemerd-linux into its own go:embed directive in pkg/vm, extract it to <DataDir>\vm\linux\ephemerd-linux on every service start, and append a tiny runtime-generated cpio (assets/ephemerd-linux) to the boot initrd before the VM boots. The kernel merges concatenated cpios into a single initramfs, so the init script's /assets/ephemerd-linux is the freshest binary every time. Initrd drops from ~190MB to ~7.6MB at build time.
…ward
Four gaps in our fake Docker daemon prevented kind create cluster + kind
load image-archive + tilt ci from working end-to-end. Each one blocked
the next.
1. /etc/hostname bind-mount. systemd inside kindest/node reads /etc/hostname
at boot and sets the kernel hostname from it. Without our bind mount the
image default ("debuerreotype") wins, and kubeadm generates apiserver
certs with that name in the SAN list.
2. Labels in /containers/json. KIND uses `docker ps --filter label=
io.x-k8s.kind.cluster=<name>` to find its node containers. We stored
labels at create time but never returned them in the list response, so
the client filter (server-side via the filters query param) matched
nothing and `kind load image-archive` ended up running
`docker inspect ''` and failing with "invalid container name or ID".
3. Server-side filter support for both Docker SDK shapes —
{"label":{"k=v":true}} (current SDK) and {"label":["k=v"]} (older).
Without parsing the filter, we returned the full container set and
kind's `-q` output wasn't what it expected.
4. Userspace TCP port-forward proxy. KIND publishes the cluster API on
127.0.0.1:<random> in the runner's namespace via -p <host>:6443. Docker
normally sets up DNAT for this; we stored PortBindings but never wired
anything. iptables-nft inside the runner needs kernel modules our VM
doesn't carry, and the runner image has no iptables-legacy binary, so
we run a Go TCP proxy: lock an OS thread, setns into the runner netns,
net.Listen on 127.0.0.1:hostPort, forward each Accept to <containerIP>:
<containerPort>. Goroutine stays in the runner's ns for its lifetime;
the dial side reaches the kindest/node directly via the CNI bridge
they share. Stop functions live on containerEntry and fire from
cleanupContainer before the kill so in-flight kubectl calls fail fast.
…n runner
Docker bind-mounts a per-container /etc/hosts at runtime; we only mounted
resolv.conf. The actions-runner image's /etc/hosts is effectively empty,
so Go programs that call net.Listen("tcp", "localhost:10350") (tilt does
this for its CI dashboard) fall through to DNS and fail with
"lookup localhost on 1.1.1.1:53: no such host".
Add withHostsMount alongside withDNSMount with the same write-then-bind
pattern, populated with the standard 127.0.0.1/::1 localhost block. Also
hand the runner's netns to the dind server (SetRunnerNetNS) once
task.Pid() is known so port-forward proxies install into the right
namespace.
Add unit coverage for the new pieces from the KIND-support and Windows
binary-embed refactors. Also fixes errcheck on pre-existing stdinPW.Close
calls in pkg/dind/exec.go that CI flagged on this branch.
pkg/vm/initrd_windows_test.go
- writeCPIOEntry header/name/body layout and 4-byte alignment
- writeCPIOEntry symlink mode emits link target as body
- buildBootInitrd roundtrip: appended gzipped cpio contains
assets/ephemerd-linux with the binary bytes intact, ends with
TRAILER!!!, and the base initrd content is preserved as a prefix
- error paths for missing base and missing binary
pkg/dind/portforward_test.go
- forwardConn bidirectional byte flow (write→target, target→write)
- forwardConn on unreachable target closes the client side cleanly
- startPortForwardProxy rejects empty + nonexistent netns paths
- startPortForwardProxy end-to-end against /proc/self/ns/net so the
setns + listen + accept + forward path runs in CI (skips if the
sandbox denies setns into its own ns)
pkg/dind/exec.go
- log Close errors instead of dropping them (errcheck fix)
The kindest/node container runs kube-proxy in iptables mode to translate Kubernetes Service ClusterIPs to pod IPs. kube-proxy needs iptable_filter and iptable_mangle to install its FORWARD + MARK chains; without those, every pod-to-Service request hangs forever (each reqwest call burns its ~20s timeout). The ephpm pod itself is healthy throughout — the failure is the e2e test pod's HTTP calls to http://ephpm:8080/* timing out because Service routing is silently dead inside the cluster. Add iptable_filter, iptable_mangle, plus the xt_REDIRECT / xt_statistic / xt_recent matches that kube-proxy emits for NodePort handling and multi-backend load balancing. Initrd module count goes from 43 to 48.
…proxy The iptable_filter / iptable_mangle additions got kube-proxy started but its rules still didn't fire on pod traffic — without br_netfilter, the sysctl /proc/sys/net/bridge/bridge-nf-call-iptables doesn't exist, so packets crossing the CNI bridge inside kindest/node bypass iptables NAT entirely. CoreDNS then can't reach 10.96.0.1:443 to bootstrap (readiness probe goes 503), every test pod's DNS lookup fails, and tilt times out at 30 min waiting on ephpm-e2e:runtime. Add the br_netfilter module to initrdKernelModulesX86 and have the init script set bridge-nf-call-iptables=1 + bridge-nf-call-ip6tables=1 after the kernel modules load.
The first-pass insmod loop tried to load br_netfilter but the kernel warning "filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this" still showed at boot — likely a dependency-ordering race in the generic loop. Retry br_netfilter explicitly after the loop runs and check that /proc/sys/net/bridge/bridge-nf-call-iptables actually exists before echoing 1 into it. Log enabled/missing so the boot log surfaces this reliably instead of failing silently.
Confirmed via the new in-VM debug-exec endpoint by exec'ing into the
live kindest/node and dumping kube-proxy's container logs:
Warning: Extension REJECT revision 0 not supported, missing kernel module?
iptables-restore: line 15 failed
"Sync failed" ipFamily="IPv4" retryingTime="30s"
kube-proxy emits a REJECT rule for "Service with no endpoints" as the
first rule of its sync batch. Without xt_REJECT (which we never bundled),
the very first rule fails, the entire iptables-restore aborts, and NO
Service rules get installed — KUBE-SERVICES chain remains empty.
Downstream effects (the entire test failure pattern):
- 10.96.0.1:443 has no DNAT target → CoreDNS can't sync zones →
readiness probe returns 503
- Test pod's DNS lookup for "ephpm" never resolves → every
HTTP request fails with reqwest "error sending request"
- tilt ci hits 30-min timeout on ephpm-e2e:runtime
Add ipt_REJECT + nf_reject_ipv4 to initrdKernelModulesX86.
Plus the debug-exec scaffolding that made this diagnosable:
- cmd/ephemerd/debugexec_linux.go: HTTP server in worker mode that runs
cio-attached exec inside any container in any namespace, streams
stdout/stderr back over TCP. Avoids containerd cio FIFO mismatch
(Linux FIFOs vs Windows named pipes) by terminating IO inside the VM.
- cmd/ephemerd/debugexec_other.go: no-op stub for non-Linux builds.
- pkg/runtime: expose Client() for the debug-exec server.
- cmd/ctrdebug: replace cio.LogFile path-conversion bug with an HTTP
client hitting containerd_port+2.
xt_nat.ko registers the DNAT, SNAT, and NETMAP iptables targets. Without it, /proc/net/ip_tables_targets lacks DNAT and kube-proxy's iptables-restore aborts at the first KUBE-SVC DNAT line with "Extension DNAT revision 0 not supported, missing kernel module". iptables-restore is atomic, so the whole transaction is lost and no Service ClusterIP rules ever install — every pod-to-Service request black-holes, CoreDNS readiness fails with 503, and the e2e test pod times out trying to reach the ephpm Service. nf_nat.ko (the NAT engine) was already loaded; xt_MASQUERADE.ko and xt_REDIRECT.ko register their respective targets in separate modules, which is why MASQUERADE worked but plain DNAT did not. Added to both the aarch64 (macOS Vz) and x86_64 (Windows Hyper-V) module lists. Verified live in kindest/node: /proc/net/ip_tables_targets now lists DNAT, manual `iptables-legacy -t nat -A ... -j DNAT` succeeds, and kube-proxy installs the full KUBE-SVC-*/KUBE-SEP-* chain hierarchy with DNAT rules pointing at backend pod IPs.
When debugging live kindest/node containers in dind, the actual namespace name is constructed from the runner's display name (e.g. ephemerd-dind-ephemerd-github-ephpm-fast_shannon) and isn't known until the job dispatches. Probing /list?ns=ephemerd returned count=0 because the containers were in a per-job namespace, with no way to enumerate which namespaces existed. Add /namespaces returning ctrdClient.NamespaceService().List() output, and extend /list to accept ns=* (or empty ns) to iterate every namespace. This makes ctrdebug usable for live cluster inspection without needing to grep the ephemerd log for the runner name first. Also extends the /exec 404 error to report which container IDs the namespace lookup did see, so the next 404 surfaces enough context to distinguish a wrong-namespace bug from a missing-container bug.
errcheck flagged 9 unchecked fmt.Fprintf/Fprintln calls in cmd/ephemerd/debugexec_linux.go. Wrap them via a writef closure that logs at debug level if the write fails — a broken HTTP connection is the only realistic source of failure here, and there's nothing useful to do beyond logging. Also wrap the previously _ =-ignored procIO.Close per the no-discard-errors rule. staticcheck SA9003 flagged an empty if branch in pkg/dind/portforward_test.go that was meant to verify the connection closed cleanly. Inverted the guard to err != nil && !EOF && !ErrClosedPipe and add a t.Errorf for genuinely unexpected error types — what the original branch was clearly trying to express.
TestPushHandlerEndToEnd staged a synthetic image (layer + config + manifest) into containerd's content store via content.WriteBlob, but wrote the manifest blob with no labels. containerd's GC only walks references via blob labels of the form containerd.io/gc.ref.content.*, so the image record kept the manifest reachable but the manifest's config + layer were unreferenced and eligible for collection. When a GC pass fired between staging and the /push call, the layer blob was deleted and containerd's resolver reported "content digest sha256:48a9...f88a: not found" (the layer's digest) — passing locally where GC happened not to run in the ~0.3s window, failing in CI where the schedule landed differently. Add gc.ref.content.config and gc.ref.content.l.0 labels to the manifest write so the full image graph is reachable from the image record.
The gc.ref label fix didn't change anything — same digest still reports
"not found" mid-push. Add a diagnostic that runs Info on the layer and
manifest digests immediately after staging, before the test invokes the
HTTP push handler. The output will tell us whether the layer is visible
in the buildkit-namespace content bucket at the moment of staging:
- If Info returns NotFound: the WriteBlob path silently failed to
register the digest in the namespace (commit went through but the
blob-bucket entry wasn't created).
- If Info succeeds but push still fails: something between stage and
push (GC, lease expiry, namespace context drift) is dropping the
visibility.
Push the diagnostic to surface what CI actually sees, then revert once
the root cause is clear.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What started as a 10-line
cfg.VM.Linux.CPUs/MemoryMBwiring fix grew into the full set of changes needed to run ephpm's E2E test suite on a self-hosted ephemerd runner. The test job spins up a KIND cluster (kindest/node) via dind inside the Hyper-V Linux VM and asserts that an HTTP request from a test pod reaches the ephpm Service — every layer of that stack had something missing.The root-cause fix was a single ~12 KB kernel module (
xt_nat.ko); everything else was needed to even get to the point where its absence was the only thing left blocking. See## How we got herebelow for the layered history if you want it.What changed, grouped
VM config wiring (the original intent —
9593519,a923d5e)cfg.VM.Linux.CPUs/MemoryMBfrom parsed config throughstartContainerRuntimeintovm.LinuxVMConfigon macOS and Windows. Previously parsed but never reached the VM.disk_size_gbconfig and grow the VHDX on boot.dind subsystem upgrades for KIND (
188fa49,9f64f2e,a3e62de,2a77f76,1ff999d,aafad9d,bfc40d7,987ac2f)kindest/node(kindest/node refuses to start on overlayfs)./etc/resolv.confwith public DNS in containers; mount r/w for the kindest/node entrypoint.docker exec -istdin handling — io.Pipe to avoid FIFO race → overlayfs file redirect → streaming-conn detection — needed becausekubectl execanddocker exec -iuse stdin streaming through the dind exec path./etc/hostnamebind-mount, container labels + filter parsing in/containers/json, per-container port-forward proxy via setns into the runner's net namespace.VM kernel modules for kube-proxy (
e80ab96,fbc1e68,6d67149,88ebc24,52bbad5← the one that mattered)The kindest/node cluster came up but kube-proxy couldn't install Service rules. Each missing module's absence caused the same symptom (
error sending requestfrom the test pod) but a differentiptables-restorefailure underneath:iptable_filter,iptable_mangle— base tables.br_netfilter+bridge-nf-call-iptablessysctl — needed for iptables NAT to fire on bridge traffic; added a retry loop because the modprobe + sysctl race could leave the sysctl path missing.ipt_REJECT+nf_reject_ipv4— kube-proxy's first rule rejects empty-endpoint Services.xt_nat— registers the DNAT/SNAT/NETMAP iptables targets. Without it,/proc/net/ip_tables_targetslacks DNAT,iptables-restoreaborts atomically on the firstKUBE-SVC -j DNATline, and the entire hundred-rule transaction is lost — every Service ClusterIP black-holes.nf_nat(the NAT engine) was already loaded, but the iptables-side glue lives inxt_nat.koas a separate ~12 KB module. This was the last blocker.Runtime + build (
d3824a6,fb96345)/etc/hostssolocalhostresolves via files in runner containers.ephemerd-linuxis now separated from the initrd and appended via runtime cpio at boot, so Linux code changes no longer invalidate the initrd cache.Debug tooling (
52eefca, plus the existingcmd/ctrdebug/)cmd/ctrdebugconnects to the in-VM containerd over TCP forlist/execof containers across namespaces.execroutes through the in-VM HTTP debug-exec server oncontainerd_port+2because containerd's FIFO cio can't cross the Windows host ↔ Linux VM boundary (named pipes vs Linux FIFOs)./namespacesandns=*to the/listendpoint so callers don't need to grep the ephemerd log for the per-job dind namespace (ephemerd-dind-<runner-name>).xt_natrequired live inspection of/proc/net/ip_tables_targetsandlsmodinside a runningkindest/node— there was no other way.Tests (
2fd839b,739f339,2b37cf7)pkg/dind/portforwardproxy +pkg/vm/initrdcpio append.pkg/dind/registry_e2e_test.go(pre-existingTestPushHandlerEndToEnd) was failing in CI withcontent digest sha256:48a9...: not found(the synthetic layer). Two attempts:739f339— addedcontainerd.io/gc.ref.content.*labels to the staged manifest so containerd's GC would keep config + layer reachable. Did not change the failure.2b37cf7— added a post-stagecs.Info(ctx, digest)diagnostic that confirms the layer is visible in the buildkit namespace right after staging. CI went green with this present, but witht.Logfsuppressed on success we can't tell from one run whether the read serves as an implicit barrier that prevents the flake or whether the test just happened to pass. Worth watching for re-flakes; may still need a real fix (probably an explicitleases.Createaround the staging block).CI hygiene (
22ae2b5)cmd/ephemerd/debugexec_linux.go(uncheckedfmt.Fprintf/Fprintlnon HTTP response writes) — wrapped via awritefclosure that logs to debug on failure.pkg/dind/portforward_test.go— converted into a propert.Errorffor unexpected error types.How we got here
Leaving this in because the next person to extend this is going to hit the same kind of cascading failure, and the debug pattern matters:
kindest/nodestarted, but the in-cluster API server's Service ClusterIP was unreachable from any pod. CoreDNS readiness 503, e2e test poderror sending request for url.kubectl execinto kindest/node was hard because containerd's FIFO cio can't bridge Windows ↔ Linux VM — builtctrdebug+ an in-VM HTTP debug-exec server (/list,/exec, later/namespaces).iptables-restoreaborting onExtension REJECT revision 0 not supported. Addedipt_REJECT+nf_reject_ipv4. Symptom stayed identical at the pod level, but kube-proxy's error changed toExtension DNAT revision 0 not supported.cat /proc/net/ip_tables_targetslistedREJECT REDIRECT MARK MASQUERADE ERROR— no DNAT.lsmod | grep natshowednf_nat(engine) loaded but noxt_nat(target registration). Added it. DNAT appeared,iptables-legacy -t nat -A ... -j DNATworked manually, and kube-proxy installed the fullKUBE-SVC-*/KUBE-SEP-*chain hierarchy.Call to undefined function ephpm_kv_get()" — both outside ephemerd's scope.Test plan
KUBE-SVC-*/KUBE-SEP-*chains with DNAT targets; ~14 e2e test suites pass cleanly. Remaining failures are isolated to ephpm app code.kindest/node:/proc/net/ip_tables_targetslists DNAT,lsmodshowsxt_nat, manualiptables-legacy -t nat -A ... -j DNATsucceeds.pkg/dindtests pass; Build (Linux), Build (Windows amd64), Build (macOS arm64) all green. Build (Linux arm64) is queued waiting for the macOS-host ephemerd's JIT poller to come online — its Vz Linux VM is aarch64 andpkg/scheduler/scheduler.go:1225-1230advertises[self-hosted, linux, arm64]based on hostgoruntime.GOARCH, so once the macOS ephemerd is polling, it will claim the job.[vm.linux] cpus = 6, memory_mb = 8192, confirm the boot log reflects configured values.mage buildand smoke run — unaffected (VM-related params ignored).Caveats / follow-ups
TestPushHandlerEndToEndInfo-diagnostic fix may be flake-masking rather than fixing. If it re-flakes, the real fix is probably an explicitleases.Createin the test to hold staged blobs alive across the staging→push boundary.Build (Linux arm64)depends on the macOS-host ephemerd being up to register a JIT runner with the matching labels. If that host is offline, the job stays queued.