TRT-2723: Ignore NodeReady=False during a CNI rollout#31319
Conversation
A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when: * the network ClusterOperator reports Progressing=True, and * the message contains NetworkPluginNotReady reported by cri-o
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@mdbooth: This pull request references TRT-2723 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughIn ChangesNetworkPluginNotReady Failure Suppression
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/monitortests/node/watchnodes/node_test.go (1)
324-449: ⚡ Quick winAdd a non-overlapping rollout case to lock time-window behavior.
The new cases cover “overlap” and “no rollout,” but not “rollout present with non-overlapping timestamps.” Adding that case (expecting one failure) will guard against regressions in overlap matching.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/node/watchnodes/node_test.go` around lines 324 - 449, Add a new test case after the existing ones that covers the scenario where a node unexpected ready event with NetworkPluginNotReady cause occurs at the same time as a network operator rollout (Progressing=True), but with non-overlapping timestamps. Create a test case with the UnexpectedNotReady interval with NetworkPluginNotReady in the AnnotationCause (similar to the first case), but have the network operator Progressing interval occur at a different time that does not bracket the node error (for example, have the network rollout occur earlier or later). Set the expected field to contain one failure string matching the node error format (similar to the second case) and set unexpectedReason to monitorapi.NodeUnexpectedReadyReason to verify that the overlap matching logic correctly requires temporal intersection before suppressing the failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/node/watchnodes/node.go`:
- Around line 409-414: The issue is that the `intervalStartDuring` helper
function used at line 413 has overly permissive comparison logic that returns
true whenever any network Progressing interval exists, regardless of actual
timestamp overlap. The helper currently uses an OR condition (>= from || <= to)
which matches too broadly. Fix the `intervalStartDuring` helper function to
implement proper interval overlap logic that correctly verifies the event's
timestamp actually falls within the bounds of the network progressing interval,
rather than just checking if the timestamp is on one side or the other of the
interval.
---
Nitpick comments:
In `@pkg/monitortests/node/watchnodes/node_test.go`:
- Around line 324-449: Add a new test case after the existing ones that covers
the scenario where a node unexpected ready event with NetworkPluginNotReady
cause occurs at the same time as a network operator rollout (Progressing=True),
but with non-overlapping timestamps. Create a test case with the
UnexpectedNotReady interval with NetworkPluginNotReady in the AnnotationCause
(similar to the first case), but have the network operator Progressing interval
occur at a different time that does not bracket the node error (for example,
have the network rollout occur earlier or later). Set the expected field to
contain one failure string matching the node error format (similar to the second
case) and set unexpectedReason to monitorapi.NodeUnexpectedReadyReason to verify
that the overlap matching logic correctly requires temporal intersection before
suppressing the failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7382de1a-2355-4672-9096-b8013baaa6df
📒 Files selected for processing (2)
pkg/monitortests/node/watchnodes/node.gopkg/monitortests/node/watchnodes/node_test.go
| // Skip NotReady events caused by NetworkPluginNotReady during network operator rollout | ||
| // NetworkPluginNotReady is a RuntimeStatus reported by cri-o and exposed by kubelet in the condition's message. | ||
| conditionMsg := unexpectedNodeUnready.Message.Annotations[monitorapi.AnnotationCause] | ||
| if strings.Contains(conditionMsg, "NetworkPluginNotReady") { | ||
| if intervalStartDuring(unexpectedNodeUnready, networkProgressingIntervals) { | ||
| continue |
There was a problem hiding this comment.
Network rollout suppression is currently over-broad due to overlap helper semantics.
At Line 413, suppression depends on intervalStartDuring(...), but that helper currently returns true for any non-empty haystack (it uses >= from || <= to). This makes NetworkPluginNotReady events get skipped whenever any network Progressing=True interval exists, even when timestamps do not overlap.
💡 Proposed fix
# pkg/monitortests/node/watchnodes/monitortest.go
- if needleStartEqualOrAfterFrom || needleStartEqualOrBeforeTo {
+ if needleStartEqualOrAfterFrom && needleStartEqualOrBeforeTo {
return true
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/node/watchnodes/node.go` around lines 409 - 414, The issue
is that the `intervalStartDuring` helper function used at line 413 has overly
permissive comparison logic that returns true whenever any network Progressing
interval exists, regardless of actual timestamp overlap. The helper currently
uses an OR condition (>= from || <= to) which matches too broadly. Fix the
`intervalStartDuring` helper function to implement proper interval overlap logic
that correctly verifies the event's timestamp actually falls within the bounds
of the network progressing interval, rather than just checking if the timestamp
is on one side or the other of the interval.
There was a problem hiding this comment.
It looks like code rabbit is correct here. However, this would be a latent bug in intervalStartDuring which is used by several other tests. It is possible that this bug is masking failures in these tests.
This PR is attempting to fix a current regression. I do not want to inadvertently trigger a new one by fixing this here. I will propose a follow-on.
There was a problem hiding this comment.
@mdbooth, that's a sensible call — fixing the intervalStartDuring logic independently in a focused follow-up avoids unintended side-effects on the other tests that rely on it, and keeps this regression fix clean.
Would you like me to open a GitHub issue to track the intervalStartDuring bug fix? I can include a reference back to this PR and the specific OR→AND logic change needed.
🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.
|
Scheduling required tests: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mdbooth, smg247 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
/test e2e-aws-ovn-fips |
|
This previously passed. In the passing run I have confirmed:
As the test passed, the exclusion seems to have worked. I'll run it again anyway for confidence as we're waiting on the FIPS job. /test e2e-gcp-ovn-upgrade |
|
/verified by CI |
|
@mdbooth: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/override ci/prow/e2e-aws-ovn-fips |
|
@smg247: Overrode contexts on behalf of smg247: ci/prow/e2e-aws-ovn-fips DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mdbooth: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
e2e-aws-ovn-fips failed 3 times for 3 different reasons that seem unrelated to this change |
A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests