Skip to content

fix: pre-release validation fixes for kubeadm, networking, and CTK#716

Merged
ArangoGutierrez merged 19 commits intoNVIDIA:mainfrom
ArangoGutierrez:release-validation
Mar 11, 2026
Merged

fix: pre-release validation fixes for kubeadm, networking, and CTK#716
ArangoGutierrez merged 19 commits intoNVIDIA:mainfrom
ArangoGutierrez:release-validation

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Fix kubeadm init health check failures on K8s v1.33+ by using private IP for controlPlaneEndpoint during init (public DNS is not routable from within the instance)
  • Fix HA cluster NLB chicken-and-egg: use local IP for init, NLB DNS for join
  • Fix inter-node and self-referencing security group rules for cluster and single-node networking
  • Fix cri-dockerd socket errors when Docker is the runtime (CTK restarts dockerd, crashing cri-dockerd)
  • Fix nvidia-container-runtime missing at /usr/bin when CTK is built from git source
  • Fix duplicate apiServer: block in kubeadm config when feature gates (DRA) are enabled
  • Fix tilde expansion in SSH key paths across all providers
  • Fix kubeconfig server URL rewrite to use public endpoint after cluster provisioning
  • Fix NLB target group name truncation to AWS 32-char limit

Test plan

  • All unit tests pass (go test ./... excluding e2e)
  • 20/22 e2e tests pass across all tiers:
    • Tier 1 (Core): Legacy, Default (Docker), Kernel
    • Tier 2 (K8s Sources): Git, KIND Git, Latest
    • Tier 3 (Advanced): DRA, CTK Git
    • Tier 4 (RPM): Rocky 9, AL2023, Fedora 42 (containerd + crio)
    • Tier 5 (Clusters): 3-GPU, 1CP+3GPU dedicated, Minimal
    • Tier 6 (Advanced Clusters): HA, RPM Rocky 9, RPM AL2023
  • 2 known failures are upstream NVIDIA repo issues on AL2023+Docker (not our bug)
  • 3 previously-failing tests (Default Docker, CTK Git, DRA) re-verified with fresh e2e runs

kubeadm v1.33+ validates API server health via the control-plane-endpoint
URL. When set to EC2 public DNS, this can timeout because the public DNS
isn't routable from within the instance during init. Use the node's private
IP for init and include the public DNS in cert SANs for external access.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
kubeadm v1.33+ health checks reach the API server via the control-plane-
endpoint. With NLB, this creates a deadlock: NLB can't route until API
server is healthy, but kubeadm won't report healthy until NLB routes.
Fix: init with local private IP, include NLB DNS in cert SANs, then
update kubeadm-config and admin.conf to reference NLB after init.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
nvidia-container-toolkit provides nvidia-ctk but may not provide the
nvidia-container-runtime binary on RPM distros. Container runtimes
(containerd, CRI-O) require it. Create symlink if missing.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
AWS enforces a 32-character limit on target group names. Long holodeck
cluster names could exceed this, causing NLB creation to fail.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Without an explicit version, RPM tests pull the latest K8s release which
may be incompatible with the test infrastructure.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
determineControlPlaneEndpoint returned PrivateIP when no LoadBalancerDNS,
making kubeconfig unreachable from outside the VPC. Use PublicIP instead.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Existing VPC CIDR rules only cover specific ports. Self-referencing SG
rules allow all TCP/UDP/ICMP between instances in the same security group,
covering webhooks, NodePort, IPIP (Calico), and future K8s services.
Uses explicit protocols (not -1) for stricter compliance.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Consistent with cluster SG: adds TCP/UDP/ICMP self-referencing rules
so instances in the same SG can communicate freely.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Kubeconfig downloaded from remote nodes contains private IPs in the
server URL, making it unusable from outside the VPC. Add structured
YAML parsing to rewrite the server URL to the public IP or NLB DNS.

Update all GetKubeConfig callers with the new desiredServerURL parameter.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
os.ReadFile does not expand ~ in paths. ExpandPath replaces a leading
tilde with the user's home directory. Used for privateKey paths.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
All three SSH connection sites now expand ~ to the user's home
directory, allowing users to specify paths like ~/.ssh/my-key.pem.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
kubeadm v1.33+ validates the API server via control-plane-endpoint
during init. When this is set to a public IP or DNS name, the health
check times out because the endpoint isn't routable from within the
instance during bootstrap. Previously this was only fixed for HA mode
with NLB; now all cluster configurations use the local IP for init.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
For non-HA clusters, all nodes are in the same VPC so the private IP
is always routable. Using the public IP for kubeadm join fails because
intra-VPC traffic via public IPs goes through the IGW and may timeout.
External access (kubeconfig) is handled by RewriteKubeConfigServer.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
When Docker is the container runtime, CTK installation restarts dockerd
between provisioning steps. cri-dockerd loses its Docker connection and
crashes. With systemd's StartLimitBurst=3 in 60s, it may not auto-recover
by the time kubeadm runs, resulting in "no such file or directory" errors
for /run/cri-dockerd.sock.

Add a systemctl reset-failed + restart for cri-docker.service before
kubeadm init when Docker is the runtime. Also fix hardcoded
cri-dockerd.sock references in diagnostics and kubeadm reset to use the
template's CriSocket variable, making them work with all runtimes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
nvidia-ctk runtime configure hardcodes /usr/bin/nvidia-container-runtime
in the container runtime config. When CTK is built from source, the binary
gets installed to /usr/local/bin, causing containerd to fail with
"fork/exec /usr/bin/nvidia-container-runtime: no such file or directory"
for all pods including control plane components.

Split the symlink logic into two steps:
1. Ensure nvidia-container-runtime is in PATH (symlink from nvidia-ctk)
2. Unconditionally ensure /usr/bin/nvidia-container-runtime exists

Applied to both git and latest CTK templates.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
… gates

When feature gates are enabled (e.g. DRA), the kubeadm config template
already contains an apiServer: block with extraArgs. The certSANs
injection via sed was unconditionally appending a new apiServer: block,
creating duplicate YAML keys. kubeadm uses the last occurrence, so the
certSANs were silently ignored.

Now detect whether apiServer: already exists in the config and inject
certSANs into it rather than creating a duplicate block.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings March 10, 2026 15:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Holodeck’s Kubernetes provisioning flow and related tooling to better support newer kubeadm behaviors (endpoint reachability during init), improve NVIDIA Container Toolkit runtime compatibility, and refresh test fixtures/config usage.

Changes:

  • Add kubeadm init endpoint/server URL handling (private-IP init + SANs, NLB DNS readiness, kubeconfig server rewriting).
  • Add ~ path expansion utility and apply it to CLI SSH key handling (with partial adoption in provisioner).
  • Tighten AWS resource-name handling to respect 32-char NLB/target-group limits and broaden intra-SG traffic rules for cluster functionality.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/data/test_rpm_rocky9_crio.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_rocky9_containerd.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_fedora42_crio.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_fedora42_containerd.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_al2023_docker.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_al2023_crio.yml Pin Kubernetes install source to release and set v1.31.0.
tests/data/test_rpm_al2023_containerd.yml Pin Kubernetes install source to release and set v1.31.0.
pkg/utils/path.go Introduce ExpandPath helper for leading-tilde expansion.
pkg/utils/path_test.go Add unit tests for ExpandPath.
pkg/utils/kubeconfig.go Add kubeconfig server rewrite helper and extend GetKubeConfig to optionally rewrite server URL.
pkg/utils/kubeconfig_test.go Add unit tests for kubeconfig server rewriting.
pkg/provisioner/templates/kubernetes.go Improve kubeadm init robustness (cri-dockerd recovery, private-IP init endpoint, cert SAN injection, CRI socket usage in diagnostics/reset).
pkg/provisioner/templates/kubernetes_test.go Update template assertion to match new --control-plane-endpoint behavior.
pkg/provisioner/templates/kubeadm_cluster.go Add NLB DNS resolution wait, private-IP init endpoint + SANs, and post-init config updates for HA.
pkg/provisioner/templates/container-toolkit.go Ensure nvidia-container-runtime exists via symlinks for newer toolkit layouts; apply in templates.
pkg/provisioner/templates/common.go Extend toolkit verification to ensure runtime binary presence (now also creates a symlink).
pkg/provisioner/provisioner.go Add inline ~ expansion for SSH key path in provisioner connection logic.
pkg/provisioner/cluster.go Clarify/control control-plane endpoint selection for internal cluster comms vs external kubeconfig access.
pkg/provisioner/cluster_test.go Adjust test data to include PublicIP while expecting private endpoint selection.
pkg/provider/aws/nlb.go Truncate NLB and target group names to meet AWS 32-char limit.
pkg/provider/aws/create.go Add self-referencing SG ingress to allow intra-SG traffic (TCP/UDP/ICMP).
pkg/provider/aws/cluster.go Add self-referencing SG ingress for multinode cluster communications.
cmd/cli/status/status_test.go Update instance ID fixtures (shorter/hex-like IDs).
cmd/cli/get/get.go Update GetKubeConfig call signature (new parameter).
cmd/cli/dryrun/dryrun.go Use utils.ExpandPath for SSH key path handling in dryrun SSH connect helper.
cmd/cli/delete/delete_test.go Update instance ID fixtures across delete command tests.
cmd/cli/create/create.go Update GetKubeConfig call signature (new parameter).
cmd/cli/common/host.go Use utils.ExpandPath when establishing SSH connections.
cmd/action/ci/entrypoint.go Update GetKubeConfig call signature (new parameter).

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +41 to +42
// ~/foo/bar → /home/user/foo/bar
return filepath.Join(home, path[2:]), nil
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpandPath currently expands any string starting with "" by joining home with path[2:], which only works for "/...". For inputs like "user/..." or "foo" this produces an incorrect path (drops the second character) instead of leaving it unchanged or handling the username form explicitly. Consider restricting expansion to exactly "" and "/" (and maybe "~" on Windows), and otherwise return the original path unchanged (or an error) to avoid surprising behavior.

Suggested change
// ~/foo/bar → /home/user/foo/bar
return filepath.Join(home, path[2:]), nil
// Only expand the current user's home directory for "~/" or "~\".
// Other forms like "~user" are returned unchanged.
if strings.HasPrefix(path, "~/") || strings.HasPrefix(path, "~\\") {
// ~/foo/bar → /home/user/foo/bar
return filepath.Join(home, path[2:]), nil
}
// Unknown tilde form; return unchanged to avoid corrupting the path.
return path, nil

Copilot uses AI. Check for mistakes.
Comment on lines +454 to +463
if strings.HasPrefix(keyPath, "~") {
home, homeErr := os.UserHomeDir()
if homeErr != nil {
return nil, fmt.Errorf("expanding key path: %w", homeErr)
}
if keyPath == "~" {
keyPath = home
} else {
keyPath = filepath.Join(home, keyPath[2:])
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tilde-expansion logic assumes any "" prefix is the "/" form and slices keyPath[2:], which breaks paths like "~user/..." (and will silently compute the wrong path). Since pkg/utils now provides ExpandPath(), it would be safer and less error-prone to call that helper here (and avoid duplicating subtly different implementations across the codebase).

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +55
// AWS load balancer names are limited to 32 characters.
const nlbSuffix = "-nlb"
maxNLBNameLen := 32 - len(nlbSuffix)
nlbBaseName := p.ObjectMeta.Name
if len(nlbBaseName) > maxNLBNameLen {
nlbBaseName = nlbBaseName[:maxNLBNameLen]
}
lbName := nlbBaseName + nlbSuffix
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncating the environment name to fit the 32-char AWS limit can cause name collisions for different environments that share a long common prefix, leading to CreateLoadBalancer/CreateTargetGroup failures that are hard to diagnose. Consider incorporating a short, deterministic suffix (e.g., a hash of the full env name) when truncation is needed so names remain unique while still meeting the length constraint.

Copilot uses AI. Check for mistakes.
Comment on lines +316 to +322
if ! command -v nvidia-container-runtime &>/dev/null; then
holodeck_log "WARN" "nvidia-container-toolkit" \
"nvidia-container-runtime binary not found, creating symlink from nvidia-ctk"
local ctk_path
ctk_path=$(command -v nvidia-ctk)
sudo ln -sf "$ctk_path" /usr/bin/nvidia-container-runtime
fi
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holodeck_verify_toolkit is a verification helper, but it now mutates system state by creating /usr/bin/nvidia-container-runtime. This makes failures harder to detect (verification may "fix" a broken install) and introduces side effects in what looks like a pure check. Since the install templates already handle creating the symlink, consider keeping verify_toolkit as a strict check (return non-zero if missing) and doing any remediation in the installation step instead.

Suggested change
if ! command -v nvidia-container-runtime &>/dev/null; then
holodeck_log "WARN" "nvidia-container-toolkit" \
"nvidia-container-runtime binary not found, creating symlink from nvidia-ctk"
local ctk_path
ctk_path=$(command -v nvidia-ctk)
sudo ln -sf "$ctk_path" /usr/bin/nvidia-container-runtime
fi
command -v nvidia-container-runtime &>/dev/null || return 1

Copilot uses AI. Check for mistakes.
- Fix goimports grouping in kubeconfig.go (third-party before internal)
- Suppress gosec G304 in test file (path from t.TempDir())
- Fix MD029 ordered list prefix in custom-templates.md
- Fix MD013 line length in examples/README.md

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 22914416195

Details

  • 44 of 113 (38.94%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 49.656%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/path.go 10 12 83.33%
pkg/provisioner/provisioner.go 0 10 0.0%
pkg/utils/kubeconfig.go 15 29 51.72%
pkg/provider/aws/nlb.go 0 17 0.0%
pkg/provider/aws/cluster.go 0 26 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/utils/kubeconfig.go 1 22.73%
Totals Coverage Status
Change from base Build 22731419912: -0.2%
Covered Lines: 2891
Relevant Lines: 5822

💛 - Coveralls

The upstream NVIDIA container toolkit repository has intermittently
broken repomd.xml GPG signatures, causing dnf metadata download
failures on Amazon Linux 2023 and other RPM-based distros.

Disable repo-level GPG check (repo_gpgcheck=0) while keeping
individual RPM package GPG verification (gpgcheck=1) intact.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
On AL2023, both containerd and CRI-O sockets are present when using
Docker or CRI-O runtimes. The legacy kubeadm init path (k8s < v1.32)
did not pass --cri-socket, causing kubeadm to fail with "found
multiple CRI endpoints". The config-file path is unaffected because
the kubeadm config already specifies the socket.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez merged commit a7fbdd5 into NVIDIA:main Mar 11, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants