Skip to content

Harden worker private-network wait and fix CCM cluster CIDR#3

Merged
roblen45 merged 60 commits into
developfrom
fix/worker-private-net-wait-and-ccm-cidr
Jun 30, 2026
Merged

Harden worker private-network wait and fix CCM cluster CIDR#3
roblen45 merged 60 commits into
developfrom
fix/worker-private-net-wait-and-ccm-cidr

Conversation

@roblen45

Copy link
Copy Markdown
Contributor

Problem

Beim Hetzner-Setup brachen beide Worker mit Timeout when waiting for 10.0.1.2:6443 ab. Der Master war gesund (CCM initialisierte Node 10.0.1.2), aber die Worker konnten den Master über das private Netz nicht erreichen — das private Interface enp7s0 hatte (noch) keine 10.0.x.x-Adresse, und der wait_for lief vor jeder Interface-Prüfung.

Änderungen

  • k3s_worker: Neue Task stellt vor dem Master-Wait sicher, dass enp7s0 eine Cluster-Subnetz-IP hat (best-effort ip link up + dhclient, bis zu 30× 5s). Bei Fehler werden ip addr/ip route als Diagnose ausgegeben. Der stumme 300s-wait_for wird durch einen 120s-Wait mit sprechender Fehlermeldung ersetzt.
  • k3s_ccm: CCM-Manifest cluster CIDR 10.244.0.0/1610.42.0.0/16 (K3s-Default), entfernt die route CIDR ... not contained within cluster CIDR-Warnung.

🤖 Generated with Claude Code

process-engine-ci and others added 30 commits May 20, 2025 07:28
# Changelog v1.7.0 (20.05.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.6.1 und v1.7.0](v1.6.1...v1.7.0).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.6.1](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.6.1).

## Merged Pull Requests

- none

[skip ci]
# Changelog v1.8.0 (20.05.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.7.0 und v1.8.0](v1.7.0...v1.8.0).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.7.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.7.0).

## Merged Pull Requests

- none

[skip ci]
# Changelog v1.8.1 (20.05.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.0 und v1.8.1](v1.8.0...v1.8.1).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.0).

## Merged Pull Requests

- none

[skip ci]
# Changelog v1.8.2 (20.05.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.1 und v1.8.2](v1.8.1...v1.8.2).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.1](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.1).

## Merged Pull Requests

- none

[skip ci]
# Changelog v1.9.0 (04.07.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.2 und v1.9.0](v1.8.2...v1.9.0).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.2](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.2).

## Merged Pull Requests

- none

[skip ci]
# Changelog v1.9.1 (07.07.2025)

Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.9.0 und v1.9.1](v1.9.0...v1.9.1).

Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.9.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.9.0).

## Merged Pull Requests

- none

[skip ci]
Only install the External Secrets Operator with 1Password Connect when
onepassword_credentials_json variable is set. Also fix hardcoded SSH key
paths in outputs.tf.
- Add processcube_api_key terraform variable
- Create install_cuby ansible role that sets up:
  - processcube namespace
  - regcred ImagePull secret for marketplace.processcube.io
  - processcube-api-key secret with the API key
- Add cuby_domain terraform variable for ingress configuration
- Add cuby-operator.yaml.j2 template with all K8s resources
- Deploy ServiceAccount, RBAC, ConfigMap, PVC, Deployment, Service, Ingress
- Wait for rollout to complete before finishing
- Remove Ingress resource from cuby-operator template
- Remove cuby_domain terraform variable
- Update info output with port-forward command
Add NODE_EXTRA_CA_CERTS env var pointing to the Kubernetes service
account CA certificate to trust the self-signed K3s API server cert.
- Get LoadBalancer IP from ingress-nginx-controller service
- Create Ingress with cuby.<ip>.nip.io domain
- Enable TLS with letsencrypt-prod cluster issuer
- Save cuby_domain to local file during Ansible run
- Add terraform output that reads the domain from file
- Add generated files to .gitignore
- Change cluster-issuer to letsencrypt-production
- Add become: false for local file copy task
Replace hardcoded INSTALL_K3S_VERSION with INSTALL_K3S_CHANNEL=stable
so K3s always installs the latest stable release. The old pinned version
v1.28.5+k3s1 was no longer available for download, causing worker node
setup to fail.
Ubuntu 22.04 (jammy) repositories are no longer available on Hetzner
mirrors, causing apt cache updates to fail on worker nodes. Also rename
netcat package to netcat-openbsd for Ubuntu 24.04 compatibility.
Hetzner mirrors over IPv6 are unreachable from the servers, causing apt
update to hang indefinitely. Adding ForceIPv4 apt config before package
installation resolves the connectivity issue.
The apt ForceIPv4 fix only affects apt. Other tools like Python urllib
(used by Ansible get_url) still try IPv6 first, causing downloads to
fail with "Network is unreachable". Setting gai.conf precedence makes
all applications prefer IPv4.
Robin Lenz and others added 28 commits March 5, 2026 08:35
The modprobe ansible module requires the community.general collection
which is not installed. Use shell modprobe command instead.
The sysctl and ufw ansible modules require the community.general
collection which is not installed. Replace with equivalent shell
commands using the ufw and sysctl CLI tools.
Migrates routing from Ingress resources (nginx) to HTTPRoutes with a
Gateway API setup using Traefik v3, including cert-manager upgrade to
v1.20.2 and corresponding Ansible playbook updates for Hetzner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move dnsPolicy under deployment section — top-level dnsPolicy is not
allowed in traefik chart v39.x schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use standard Traefik internal ports (8000/8443) instead of 80/443 —
the chart template validates that port 8000 is declared. Also remove
hostNetwork since it's not needed with a Hetzner LoadBalancer service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fresh cluster image pulls + DaemonSet rollout exceed the 5m limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove --disable traefik from K3s install. Install Gateway API CRDs and
write a HelmChartConfig in k3s_master so K3s manages Traefik lifecycle
automatically. Remove manual helm install from k3s_addons and update all
namespace references from traefik to kube-system.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Traefik's Helm chart installs the CRDs itself via the traefik-crd
release. Installing them manually beforehand causes Helm ownership
conflicts (missing app.kubernetes.io/managed-by label).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Set namespacePolicy: All on web/websecure listeners so HTTPRoutes
  from any namespace can attach to the Gateway
- Annotate Gateway with cert-manager cluster-issuer for auto TLS
- Fix ClusterIssuer solver to reference traefik Gateway in kube-system

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The chart template requires port to be explicit when overriding
listener config. Use entrypoint ports (8000/8443), not service
ports (80/443).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use raw Gateway API structure instead of namespacePolicy (not a valid
Traefik chart field). Add required protocol field to each listener.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- HelmChartConfig: enable kubernetesIngress, disable kubernetesGateway,
  remove all Gateway API config (gatewayClass, gateway, listeners)
- cert-manager ClusterIssuers: use ingress HTTP-01 solver (class: traefik)
  instead of gatewayHTTPRoute solver
- Remove shared Gateway resource creation from k3s_addons

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Traefik v3 Helm chart installs Gateway API CRDs via traefik-crd
subchart even when kubernetesGateway is disabled. Cuby detects the
CRDs and fails if no Gateway object exists. Delete them since we
only use the Ingress controller.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Traefik HelmChartConfig

Annotating the Traefik service after creation caused a CCM reconciliation
race condition (location annotation conflicts with an already-provisioned LB).
Moving annotations into the HelmChartConfig ensures the service starts with
the correct Hetzner annotations so the CCM can provision the LoadBalancer
correctly on first reconcile.

Also adds diagnostic output (CCM logs + kubectl describe) when the LB wait
times out, so failures are easier to debug.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k3s-agent sometimes takes longer to stabilize after install. Replace direct
systemd start with a wait-loop (18x10s) that verifies `systemctl is-active`
and collects journalctl logs on failure, so flaky joins are retried and
failures are visible in the output.

Fix Verify Cluster to use awk for exact Ready-state matching instead of
grep -v NotReady, and add kubectl describe output for any NotReady nodes
to make failures self-diagnosing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoded network_zone = "eu-central" does not always match the actual
zone of the configured location (e.g. hel1 may be in a different zone).
Use a hcloud_location data source to derive the correct network_zone from
var.location, ensuring the private network subnet always matches the LB
location and the CCM can attach the LoadBalancer to the network.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace deprecated `class: traefik` with `ingressClassName: traefik` in
both letsencrypt-staging and letsencrypt-production ClusterIssuers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bump default from v1.20.0 to latest stable v1.31.0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Delete the old ClusterRoleBinding system:hcloud-cloud-controller-manager
before applying the manifest, as v1.28.0 renamed it to the :restricted
suffix. The roleRef field is immutable, so kubectl apply would fail on
existing clusters without this cleanup step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without an explicit network annotation, the Hetzner CCM could attach the
LoadBalancer to the wrong network (e.g. an existing processcube-cluster
network) when multiple clusters share the same Hetzner project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On Ubuntu 24.04, multipathd is installed by default and intercepts new
SCSI devices before udev can create the /dev/disk/by-id/scsi-0HC_Volume_*
symlinks that the Hetzner CSI driver depends on. This causes FailedMount
errors with "The file does not exist" even when Hetzner reports the volume
as successfully attached.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workers timed out for 300s on the master API (10.0.1.2:6443) when the
private interface enp7s0 had not yet received its 10.0.x.x address, since
the wait ran before any interface check. Add a pre-wait task that ensures
enp7s0 is up with a cluster-subnet IP (best-effort link up + dhclient),
emit interface/route diagnostics on failure, and replace the silent 300s
wait_for with a 120s one carrying an actionable message.

Also align the CCM manifest cluster CIDR (10.244.0.0/16) with the K3s
default (10.42.0.0/16) to remove the route-controller CIDR-mismatch
warning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DZPTZ6d76eLgCpw8FheVgq
@roblen45 roblen45 merged commit fb99c53 into develop Jun 30, 2026
1 check failed
@roblen45 roblen45 deleted the fix/worker-private-net-wait-and-ccm-cidr branch June 30, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants