Harden worker private-network wait and fix CCM cluster CIDR by roblen45 · Pull Request #3 · processcube-io/ProcessCube.Deployment

roblen45 · 2026-06-30T09:03:01Z

Problem

Beim Hetzner-Setup brachen beide Worker mit Timeout when waiting for 10.0.1.2:6443 ab. Der Master war gesund (CCM initialisierte Node 10.0.1.2), aber die Worker konnten den Master über das private Netz nicht erreichen — das private Interface enp7s0 hatte (noch) keine 10.0.x.x-Adresse, und der wait_for lief vor jeder Interface-Prüfung.

Änderungen

k3s_worker: Neue Task stellt vor dem Master-Wait sicher, dass enp7s0 eine Cluster-Subnetz-IP hat (best-effort ip link up + dhclient, bis zu 30× 5s). Bei Fehler werden ip addr/ip route als Diagnose ausgegeben. Der stumme 300s-wait_for wird durch einen 120s-Wait mit sprechender Fehlermeldung ersetzt.
k3s_ccm: CCM-Manifest cluster CIDR 10.244.0.0/16 → 10.42.0.0/16 (K3s-Default), entfernt die route CIDR ... not contained within cluster CIDR-Warnung.

🤖 Generated with Claude Code

# Changelog v1.7.0 (20.05.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.6.1 und v1.7.0](v1.6.1...v1.7.0). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.6.1](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.6.1). ## Merged Pull Requests - none [skip ci]

# Changelog v1.8.0 (20.05.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.7.0 und v1.8.0](v1.7.0...v1.8.0). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.7.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.7.0). ## Merged Pull Requests - none [skip ci]

# Changelog v1.8.1 (20.05.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.0 und v1.8.1](v1.8.0...v1.8.1). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.0). ## Merged Pull Requests - none [skip ci]

# Changelog v1.8.2 (20.05.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.1 und v1.8.2](v1.8.1...v1.8.2). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.1](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.1). ## Merged Pull Requests - none [skip ci]

# Changelog v1.9.0 (04.07.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.8.2 und v1.9.0](v1.8.2...v1.9.0). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.8.2](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.8.2). ## Merged Pull Requests - none [skip ci]

# Changelog v1.9.1 (07.07.2025) Dieser Changelog deckt die Änderungen zwischen folgenden Versionen ab: [v1.9.0 und v1.9.1](v1.9.0...v1.9.1). Weitere Hinweise befinden sich im Changelog der vorherigen Version: [v1.9.0](https://github.com/5minds/ProcessCube.Deployment/releases/tag/v1.9.0). ## Merged Pull Requests - none [skip ci]

Only install the External Secrets Operator with 1Password Connect when onepassword_credentials_json variable is set. Also fix hardcoded SSH key paths in outputs.tf.

- Add processcube_api_key terraform variable - Create install_cuby ansible role that sets up: - processcube namespace - regcred ImagePull secret for marketplace.processcube.io - processcube-api-key secret with the API key

- Add cuby_domain terraform variable for ingress configuration - Add cuby-operator.yaml.j2 template with all K8s resources - Deploy ServiceAccount, RBAC, ConfigMap, PVC, Deployment, Service, Ingress - Wait for rollout to complete before finishing

- Remove Ingress resource from cuby-operator template - Remove cuby_domain terraform variable - Update info output with port-forward command

Add NODE_EXTRA_CA_CERTS env var pointing to the Kubernetes service account CA certificate to trust the self-signed K3s API server cert.

- Get LoadBalancer IP from ingress-nginx-controller service - Create Ingress with cuby.<ip>.nip.io domain - Enable TLS with letsencrypt-prod cluster issuer

- Save cuby_domain to local file during Ansible run - Add terraform output that reads the domain from file - Add generated files to .gitignore

- Change cluster-issuer to letsencrypt-production - Add become: false for local file copy task

Replace hardcoded INSTALL_K3S_VERSION with INSTALL_K3S_CHANNEL=stable so K3s always installs the latest stable release. The old pinned version v1.28.5+k3s1 was no longer available for download, causing worker node setup to fail.

Ubuntu 22.04 (jammy) repositories are no longer available on Hetzner mirrors, causing apt cache updates to fail on worker nodes. Also rename netcat package to netcat-openbsd for Ubuntu 24.04 compatibility.

Hetzner mirrors over IPv6 are unreachable from the servers, causing apt update to hang indefinitely. Adding ForceIPv4 apt config before package installation resolves the connectivity issue.

The apt ForceIPv4 fix only affects apt. Other tools like Python urllib (used by Ansible get_url) still try IPv6 first, causing downloads to fail with "Network is unreachable". Setting gai.conf precedence makes all applications prefer IPv4.

The modprobe ansible module requires the community.general collection which is not installed. Use shell modprobe command instead.

The sysctl and ufw ansible modules require the community.general collection which is not installed. Replace with equivalent shell commands using the ufw and sysctl CLI tools.

Migrates routing from Ingress resources (nginx) to HTTPRoutes with a Gateway API setup using Traefik v3, including cert-manager upgrade to v1.20.2 and corresponding Ansible playbook updates for Hetzner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move dnsPolicy under deployment section — top-level dnsPolicy is not allowed in traefik chart v39.x schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use standard Traefik internal ports (8000/8443) instead of 80/443 — the chart template validates that port 8000 is declared. Also remove hostNetwork since it's not needed with a Hetzner LoadBalancer service. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fresh cluster image pulls + DaemonSet rollout exceed the 5m limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove --disable traefik from K3s install. Install Gateway API CRDs and write a HelmChartConfig in k3s_master so K3s manages Traefik lifecycle automatically. Remove manual helm install from k3s_addons and update all namespace references from traefik to kube-system. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Traefik's Helm chart installs the CRDs itself via the traefik-crd release. Installing them manually beforehand causes Helm ownership conflicts (missing app.kubernetes.io/managed-by label). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Set namespacePolicy: All on web/websecure listeners so HTTPRoutes from any namespace can attach to the Gateway - Annotate Gateway with cert-manager cluster-issuer for auto TLS - Fix ClusterIssuer solver to reference traefik Gateway in kube-system Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The chart template requires port to be explicit when overriding listener config. Use entrypoint ports (8000/8443), not service ports (80/443). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use raw Gateway API structure instead of namespacePolicy (not a valid Traefik chart field). Add required protocol field to each listener. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- HelmChartConfig: enable kubernetesIngress, disable kubernetesGateway, remove all Gateway API config (gatewayClass, gateway, listeners) - cert-manager ClusterIssuers: use ingress HTTP-01 solver (class: traefik) instead of gatewayHTTPRoute solver - Remove shared Gateway resource creation from k3s_addons Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Traefik v3 Helm chart installs Gateway API CRDs via traefik-crd subchart even when kubernetesGateway is disabled. Cuby detects the CRDs and fails if no Gateway object exists. Delete them since we only use the Ingress controller. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Traefik HelmChartConfig Annotating the Traefik service after creation caused a CCM reconciliation race condition (location annotation conflicts with an already-provisioned LB). Moving annotations into the HelmChartConfig ensures the service starts with the correct Hetzner annotations so the CCM can provision the LoadBalancer correctly on first reconcile. Also adds diagnostic output (CCM logs + kubectl describe) when the LB wait times out, so failures are easier to debug. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

k3s-agent sometimes takes longer to stabilize after install. Replace direct systemd start with a wait-loop (18x10s) that verifies `systemctl is-active` and collects journalctl logs on failure, so flaky joins are retried and failures are visible in the output. Fix Verify Cluster to use awk for exact Ready-state matching instead of grep -v NotReady, and add kubectl describe output for any NotReady nodes to make failures self-diagnosing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Hardcoded network_zone = "eu-central" does not always match the actual zone of the configured location (e.g. hel1 may be in a different zone). Use a hcloud_location data source to derive the correct network_zone from var.location, ensuring the private network subnet always matches the LB location and the CCM can attach the LoadBalancer to the network. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace deprecated `class: traefik` with `ingressClassName: traefik` in both letsencrypt-staging and letsencrypt-production ClusterIssuers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bump default from v1.20.0 to latest stable v1.31.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Delete the old ClusterRoleBinding system:hcloud-cloud-controller-manager before applying the manifest, as v1.28.0 renamed it to the :restricted suffix. The roleRef field is immutable, so kubectl apply would fail on existing clusters without this cleanup step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without an explicit network annotation, the Hetzner CCM could attach the LoadBalancer to the wrong network (e.g. an existing processcube-cluster network) when multiple clusters share the same Hetzner project. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

On Ubuntu 24.04, multipathd is installed by default and intercepts new SCSI devices before udev can create the /dev/disk/by-id/scsi-0HC_Volume_* symlinks that the Hetzner CSI driver depends on. This causes FailedMount errors with "The file does not exist" even when Hetzner reports the volume as successfully attached. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workers timed out for 300s on the master API (10.0.1.2:6443) when the private interface enp7s0 had not yet received its 10.0.x.x address, since the wait ran before any interface check. Add a pre-wait task that ensures enp7s0 is up with a cluster-subnet IP (best-effort link up + dhclient), emit interface/route diagnostics on failure, and replace the silent 300s wait_for with a 120s one carrying an actionable message. Also align the CCM manifest cluster CIDR (10.244.0.0/16) with the K3s default (10.42.0.0/16) to remove the route-controller CIDR-mismatch warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DZPTZ6d76eLgCpw8FheVgq

process-engine-ci and others added 30 commits May 20, 2025 07:28

bump version to 1.8.0

72fdea3

fix postgres tag

8226065

merge with develop

7b34ec1

Merge branch 'develop'

bb9f59e

update versions

f6e9af6

bump version to 1.9.0

332276c

fix lowcode dashboard auth

20b2afe

add hetzner deplyoment

ab7862d

Make External Secrets Operator installation optional

0b374c3

Only install the External Secrets Operator with 1Password Connect when onepassword_credentials_json variable is set. Also fix hardcoded SSH key paths in outputs.tf.

Add install_cuby role with ProcessCube marketplace secrets

719a77a

- Add processcube_api_key terraform variable - Create install_cuby ansible role that sets up: - processcube namespace - regcred ImagePull secret for marketplace.processcube.io - processcube-api-key secret with the API key

use cuby 0.6.0-develop.4

6bff71e

Remove Cuby ingress, use port-forward instead

cf196da

- Remove Ingress resource from cuby-operator template - Remove cuby_domain terraform variable - Update info output with port-forward command

Fix Cuby K8s certificate issue with NODE_EXTRA_CA_CERTS

c7dfc5d

Add NODE_EXTRA_CA_CERTS env var pointing to the Kubernetes service account CA certificate to trust the self-signed K3s API server cert.

Add dynamic Ingress for Cuby using LoadBalancer IP

a3d5c6e

- Get LoadBalancer IP from ingress-nginx-controller service - Create Ingress with cuby.<ip>.nip.io domain - Enable TLS with letsencrypt-prod cluster issuer

Add cuby_url terraform output

284f60d

- Save cuby_domain to local file during Ansible run - Add terraform output that reads the domain from file - Add generated files to .gitignore

fix Save cuby_domain to local file

53adca3

Fix cluster-issuer name and become for local task

136a12b

- Change cluster-issuer to letsencrypt-production - Add become: false for local file copy task

bump cuby image to 0.6.0-develop.6

001cf37

Fix ArgoCD installation annotation size limit error

a5099be

Use K3s stable channel instead of pinned version

df61edf

Replace hardcoded INSTALL_K3S_VERSION with INSTALL_K3S_CHANNEL=stable so K3s always installs the latest stable release. The old pinned version v1.28.5+k3s1 was no longer available for download, causing worker node setup to fail.

Upgrade server image to Ubuntu 24.04 LTS

bca9df4

Ubuntu 22.04 (jammy) repositories are no longer available on Hetzner mirrors, causing apt cache updates to fail on worker nodes. Also rename netcat package to netcat-openbsd for Ubuntu 24.04 compatibility.

Force apt to use IPv4 to fix hanging cache updates

e736e27

Hetzner mirrors over IPv6 are unreachable from the servers, causing apt update to hang indefinitely. Adding ForceIPv4 apt config before package installation resolves the connectivity issue.

Robin Lenz and others added 28 commits March 5, 2026 08:35

Replace modprobe module with shell command

bda9bb8

The modprobe ansible module requires the community.general collection which is not installed. Use shell modprobe command instead.

Replace community.general modules with shell commands

8c0293e

The sysctl and ufw ansible modules require the community.general collection which is not installed. Replace with equivalent shell commands using the ufw and sysctl CLI tools.

Remove Cuby and ArgoCD from Hetzner deployment

74e38ff

Fix Traefik Helm chart dnsPolicy schema error

bd438c3

Move dnsPolicy under deployment section — top-level dnsPolicy is not allowed in traefik chart v39.x schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Increase Traefik Helm install timeout to 10m

67e460c

Fresh cluster image pulls + DaemonSet rollout exceed the 5m limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add port to Traefik gateway listeners in HelmChartConfig

7ec6915

The chart template requires port to be explicit when overriding listener config. Use entrypoint ports (8000/8443), not service ports (80/443). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix Traefik gateway listener config in HelmChartConfig

6025093

Use raw Gateway API structure instead of namespacePolicy (not a valid Traefik chart field). Add required protocol field to each listener. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add certificateRef to Traefik HTTPS gateway listener

bd9b38d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix trailing newline in SSH public key causing Terraform inconsistency

9f1cda0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ignore known_hosts to prevent SSH timeout on server recreation

ec7713c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wait for SSH availability instead of static sleep before Ansible

048f1ec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use ssh-keyscan for SSH readiness check instead of nc

99ec35d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix deprecated ingress class field in ClusterIssuer HTTP01 solvers

22569b5

Replace deprecated `class: traefik` with `ingressClassName: traefik` in both letsencrypt-staging and letsencrypt-production ClusterIssuers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update hcloud-cloud-controller-manager to v1.31.0

6e49abb

Bump default from v1.20.0 to latest stable v1.31.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

roblen45 merged commit fb99c53 into develop Jun 30, 2026
1 check failed

roblen45 deleted the fix/worker-private-net-wait-and-ccm-cidr branch June 30, 2026 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden worker private-network wait and fix CCM cluster CIDR#3

Harden worker private-network wait and fix CCM cluster CIDR#3
roblen45 merged 60 commits into
developfrom
fix/worker-private-net-wait-and-ccm-cidr

roblen45 commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

roblen45 commented Jun 30, 2026

Problem

Änderungen

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants