THREESCALE-14652 Prevent perpetual reconcile by hardcoding K8s-defaulted fields in component specs#1174
THREESCALE-14652 Prevent perpetual reconcile by hardcoding K8s-defaulted fields in component specs#1174urbanikb wants to merge 2 commits into
Conversation
f0e8544 to
2efbfb9
Compare
|
@tkan145 the required tests on prow are failing due to CI already using go 1.25 Other than this the PR is ready for review. I will rebase this on master when #1173 is merged to clear the CI errors. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1174 +/- ##
==========================================
+ Coverage 41.84% 44.28% +2.43%
==========================================
Files 203 204 +1
Lines 20859 21087 +228
==========================================
+ Hits 8729 9338 +609
+ Misses 11350 10947 -403
- Partials 780 802 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…ponent specs
The K8s API server fills zero-value fields with defaults when a resource is
written. Mutators using reflect.DeepEqual then see a mismatch between the
desired spec (Go zero values) and the live spec (K8s-filled defaults),
triggering an update on every reconcile cycle.
Explicitly set all K8s-defaulted fields in every component Deployment spec:
- Probe fields: Scheme, TimeoutSeconds, PeriodSeconds, SuccessThreshold,
FailureThreshold
- Container / init-container fields: TerminationMessagePath,
TerminationMessagePolicy, ImagePullPolicy (init containers are most
sensitive — DeploymentPodInitContainerMutator does a full struct DeepEqual)
- Pod spec fields: RestartPolicy, DNSPolicy, SecurityContext,
TerminationGracePeriodSeconds, SchedulerName
- Volume source fields: DefaultMode on Secret, ConfigMap, and Projected
volume sources
- Use nil (not []T{}) for optional volume/volumemount slices so
reflect.DeepEqual treats K8s-normalised absent and locally-absent the same
Extend UpdateResource to log the object's namespace and APIManager owner
name as structured fields, enabling the integration test's ReconcileCounter
to attribute each Deployment update to the correct CR instance.
Add ReconcileCounter and verifyNoDeploymentUpdates to the integration test
suite to assert ≤50 total Deployment update calls per APIManager install,
providing a regression test for this class of bug.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
08edbf7 to
dcac2e2
Compare
|
/retest |
1 similar comment
|
/retest |
| typeName := strings.Replace(fmt.Sprintf("%T", obj), "*", "", 1) | ||
| crName := "" | ||
| for _, ref := range obj.GetOwnerReferences() { | ||
| if ref.Kind == "APIManager" { | ||
| crName = ref.Name | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
What is this and why do we need this?
| } | ||
|
|
||
| func (backend *Backend) backendContainerVolumeMounts() []v1.VolumeMount { | ||
| res := []v1.VolumeMount{} |
There was a problem hiding this comment.
Any reason why we need to change this? Seems a bit weird to only change the format here and leave everything else the same.
There was a problem hiding this comment.
There is a mutator for the volume that does DeepEqual and detects this as a change - we need to create object that has nil instead of empty array because that what Apiserver does...
| func verifyNoDeploymentUpdates(namespace, name string, w io.Writer) { | ||
| updateCounts := reconcileCounter.GetUpdateCounts(namespace, name) | ||
| totalUpdates := reconcileCounter.GetTotalUpdates(namespace, name) | ||
|
|
||
| fmt.Fprintf(w, "\n=== Deployment Update Report (%s/%s, session ceiling %d) ===\n", | ||
| namespace, name, maxTotalDeploymentUpdates) | ||
| fmt.Fprintf(w, "Total: %d\n", totalUpdates) | ||
|
|
||
| names := make([]string, 0, len(updateCounts)) | ||
| for n := range updateCounts { | ||
| names = append(names, n) | ||
| } | ||
| sort.Strings(names) | ||
| for _, n := range names { | ||
| fmt.Fprintf(w, " %s: %d\n", n, updateCounts[n]) | ||
| } | ||
| fmt.Fprintf(w, "=============================================================\n\n") | ||
|
|
||
| Expect(totalUpdates).To(BeNumerically(">=", minTotalDeploymentUpdates), | ||
| fmt.Sprintf("%s/%s: total deployment updates is 0 — counter is likely misconfigured (namespace/name fields not captured)", namespace, name)) | ||
| Expect(totalUpdates).To(BeNumerically("<=", maxTotalDeploymentUpdates), | ||
| deploymentUpdateDetail(namespace, name, updateCounts, totalUpdates)) | ||
| } |
There was a problem hiding this comment.
I really dislike this test; it's like we don't know how many times the reconciliation will happen, so we have to guess and hope the result falls within the acceptable range. It also doesn't tell us much except that the reconcile function was called x times, which function called it, and what caused that call are still unknown.
There was a problem hiding this comment.
I've updated to expect exactly 0 updates at reconcile and exactly 1 update after the part where update is triggered. Does this look better?
- Remove WithRequest, pass BaseReconciler directly - Remove cosmetic K8s defaults (RestartPolicy, DNSPolicy, etc.) that have no corresponding mutator — keep only fields compared by DeploymentPodInitContainerMutator, DeploymentVolumesMutator, and DeploymentProbesMutator - Revert apicast.go and memcached.go fully to master (no mutators require defaults in those components) - Simplify integration test to assert exact update counts (0 before synthetic trigger, 1 after) instead of ceiling/floor ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@urbanikb: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
https://redhat.atlassian.net/browse/THREESCALE-14652
Summary
Fixes perpetual Deployment update loops caused by the K8s API server silently filling zero-value fields with defaults on write. Mutators using
reflect.DeepEqualthen always see a mismatch between the desired spec (Go zero values) and the live spec (K8s-filled defaults), triggering an update on every reconcile cycle.Root cause fields explicitly set in all component Deployment specs:
Scheme,TimeoutSeconds,PeriodSeconds,SuccessThreshold,FailureThresholdTerminationMessagePath,TerminationMessagePolicy,ImagePullPolicy(init containers most sensitive —DeploymentPodInitContainerMutatordoes a full structDeepEqual)RestartPolicy,DNSPolicy,SecurityContext,TerminationGracePeriodSeconds,SchedulerNameDefaultModeonSecret,ConfigMap, andProjectedvolume sourcesnil(not[]T{}) for optional volume/volumemount slicesObservability:
UpdateResourcenow logsnamespaceand APIManager ownernameas structured fields, enabling the integration test counter to attribute each Deployment update to the correct CR instance.Regression test:
ReconcileCounter+verifyNoDeploymentUpdatesassert that total Deployment update calls per APIManager install stay within[1, 50]. The floor of 1 guards against a silent counter misconfiguration; the ceiling of 50 stays well below what the perpetual-reconcile bug produced (hundreds of updates per install).Test plan
WATCH_NAMESPACE=dummy make test-integrationTotalbetween 1 and 50🤖 Generated with Claude Code