feat: add a 'no-dataplane' option which allows for valid products to be installed with 'linuxDataplane: None'#4954
feat: add a 'no-dataplane' option which allows for valid products to be installed with 'linuxDataplane: None'#4954doucol wants to merge 25 commits into
Conversation
BSD tar stops option parsing at the first non-option argument, so the trailing -O in the helm download recipe was treated as a file pattern rather than --to-stdout. On macOS this extracted a stray hack/bin/helm, wrote a 0-byte helm-<arch>-<version> via the shell redirection, and failed -- and since the empty file was left behind, subsequent runs died with "Permission denied". GNU tar permutes arguments, which is why this only broke on macOS. Move -O before the member name (portable to both tars) and drop the no-op -C/--strip-components flags. Also add .DELETE_ON_ERROR so a failed recipe can't leave a partial target that poisons later builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ter issue in kind
…ll valid products
this address a couple of issues codex uncovered:
1. P1: Overlay-driven dataplane mode causes an infinite restart loop.
Startup detection reads only the raw default Installation in
cmd/main.go:520, but reconcile later applies the overlay Installation in
pkg/controller/installation/core_controller.go:960 before comparing
against r.dataplaneDisabled and calling osExitOverride(0) in
pkg/controller/installation/core_controller.go:1057. If linuxDataplane:
None is supplied via
overlay, the pod starts with DataplaneDisabled=false, reconcile
computes true, exits, and the next pod repeats the same startup read.
The inverse case also loops. Startup detection needs to use the same
effective spec calculation as GetInstallationSpec/reconcile.
2. P2: The new no-dataplane FV setup does not run in the intended
controller mode.
setupManager registers controllers before the no-dataplane
Installation exists and never sets ControllerOptions.DataplaneDisabled
in test/mainline_test.go:339. The no-dataplane specs create the
Installation afterward, e.g. test/no_dataplane_test.go:212, so the first
reconcile sees a mode mismatch and calls the real os.Exit(0) path from
pkg/controller/
installation/core_controller.go:1057. That makes the new FV target
either terminate the test binary or avoid exercising the corrected
no-dataplane controller set. The FV helper should create the
Installation before AddToManager, pass DataplaneDisabled: true, or
override the exit path in-process.
electricjesus
left a comment
There was a problem hiding this comment.
looks pretty solid to me! just a few comments..
| // fixed when the manager starts. If the dataplane mode has changed since the operator | ||
| // started, reboot so the correct controllers are (de)registered. This mirrors the enterprise | ||
| // switch below; controller-runtime cannot add or remove controllers from a running manager. | ||
| if instance.DeletionTimestamp.IsZero() && r.dataplaneDisabled != instance.Spec.DataplaneDisabled() { |
There was a problem hiding this comment.
This reboot might never fire on a None→Iptables switch. We boot with the dataplane off, so the IPPool controller isn't registered. Flip back to Iptables and needsIPPools goes true, but there are no pools and nothing running to create them, so I think we'd hit the "waiting for enabled IP pools" return above and bail before reaching this check. Looks like the operator would get stuck and couldn't restart itself out of it..
Could we move this block up to right after dataplaneDisabled :=, above the cni and pool gates? That direction isn't covered by the test either, so might be worth a case.
Also might be worth a return after osExitOverride(0). os.Exit kills us in prod so it doesn't bite there, but checkActive returns right after its exit for the test no-op case and this one just falls through.
| // There is no Calico dataplane, so spec.cni (which describes the CNI the | ||
| // operator integrates Felix with) must be omitted entirely. | ||
| if instance.Spec.CNI != nil { | ||
| return fmt.Errorf("spec.cni must not be set when spec.calicoNetwork.linuxDataplane is %s", |
There was a problem hiding this comment.
I think this might reject a config the user never actually wrote when None comes in via the overlay. The base install defaults spec.cni and we persist it, then the overlay sets linuxDataplane: None. OverrideInstallationSpec doesn't seem to clear CNI when only the base has it set (no AOnlySet case in merge.go), so the merged spec would carry both cni and None and land here degraded. main.go says the mode can come from the overlay, so I think this path is supposed to work..?
| components = append(components, kubecontrollers.NewCalicoKubeControllers(&kubeControllersCfg)) | ||
| // Build a configuration for rendering calico/kube-controllers, which is not rendered | ||
| // when the dataplane is disabled. | ||
| var kubeControllersCfg kubecontrollers.KubeControllersConfiguration |
There was a problem hiding this comment.
Looks like kube-controllers gets gated out here, but the WAF webhook cert above (wafWebhookTLS) and the cert render don't. So I think a no-dataplane cluster that turns on WAF in a GatewayAPI CR would end up with an orphan tigera-waf-webhook secret and a webhook that never runs, with no status to say why.. Might be worth gating the cert the same way, or surfacing that WAF-on-gateway isn't supported without the dataplane.
|
|
||
| // Reject any remaining dataplane-only field that is explicitly set, reporting them | ||
| // all at once so the user can fix the spec in a single pass. | ||
| var disallowed []string |
There was a problem hiding this comment.
nit, not blocking: this is a deny-list, so I think a new calicoNetwork field added later might slip through in None mode by default. Some of these (hostPorts and friends) look like the ones that'd nil-panic on spec.cni downstream, which is why the block exists. An allow-list would fail closed. Worth a comment at least so the next person adding a field knows to touch this.
| // BPF dataplane requires IP autodetection even if we're not using Calico IPAM. | ||
| // With the Linux dataplane disabled there is no calico-node to perform autodetection, | ||
| // so skip defaulting these fields (validation rejects them when the dataplane is disabled). | ||
| needIPv4Autodetection := *instance.Spec.CalicoNetwork.LinuxDataplane == operatorv1.LinuxDataplaneBPF |
There was a problem hiding this comment.
tiny: you deref LinuxDataplane here, then zero it out right below for the disabled case. Could maybe fold it: needIPv4Autodetection := !instance.Spec.DataplaneDisabled() && *...LinuxDataplane == LinuxDataplaneBPF.
Description
A suggestion from Arvind for the hackathon: add support to the operator for installing our 2 cni/dataplane agnostic products (Calico Ingress Gateway and Istio Service mesh) without our CNI nor our dataplane enabled.
This results in cni.type = nil and linuxDataplane: None
"None" is a new enum value for our linuxDataplane.
Release Note
For PR author
make gen-filesmake gen-versionsFor PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.