TRT-2723: Ignore NodeReady=False during a CNI rollout by mdbooth · Pull Request #31319 · openshift/origin

mdbooth · 2026-06-19T21:05:21Z

A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when:

the network ClusterOperator reports Progressing=True, and
the message contains NetworkPluginNotReady reported by cri-o

Summary by CodeRabbit

Release Notes

New Features
- Enhanced node readiness error messages to include underlying condition details for better diagnostics.
Bug Fixes
- Improved handling of network-related node readiness issues to prevent false failure reporting during network operator updates.
Tests
- Added comprehensive test coverage for network plugin readiness scenarios.

A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when: * the network ClusterOperator reports Progressing=True, and * the message contains NetworkPluginNotReady reported by cri-o

openshift-merge-bot · 2026-06-19T21:05:24Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-06-19T21:05:25Z

@mdbooth: This pull request references TRT-2723 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when:

the network ClusterOperator reports Progressing=True, and

the message contains NetworkPluginNotReady reported by cri-o

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-19T21:05:43Z

Walkthrough

In node.go, the unexpected-not-ready interval builder now extracts the NodeReady condition message and stores it as AnnotationCause. reportUnexpectedNodeDownFailures pre-computes networkProgressingIntervals (ClusterOperator network, Progressing=True) and skips failures whose AnnotationCause contains NetworkPluginNotReady when they overlap those intervals. Three new test cases validate all three suppression branches.

Changes

NetworkPluginNotReady Failure Suppression

Layer / File(s)	Summary
Cause annotation and network-rollout suppression logic `pkg/monitortests/node/watchnodes/node.go`	Unexpected-not-ready intervals now store the `NodeReady` condition message in `AnnotationCause`. `reportUnexpectedNodeDownFailures` pre-filters `clusteroperator/network` `Progressing=True` intervals and skips counting a failure when `AnnotationCause` contains `NetworkPluginNotReady` and the interval overlaps a network progressing window. Comment updated to mention CNI rollout exclusion.
Test cases for the three suppression branches `pkg/monitortests/node/watchnodes/node_test.go`	Three new `TestReportUnexpectedNodeDownFailures` cases: `NetworkPluginNotReady` during a network rollout (expects zero failures), `NetworkPluginNotReady` with no rollout (expects one failure), and a non-`NetworkPluginNotReady` cause during a rollout (expects one failure).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: ignoring NodeReady=False during CNI rollouts, which directly aligns with the primary modifications across both modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR are stable and deterministic. The three new test cases use descriptive static strings without dynamic values like pod names, timestamps, UUIDs, node names, IP addresses, or...
Test Structure And Quality	✅ Passed	The three new test cases follow Go unit test best practices: each tests one scenario, all assertions include expected/actual values prefixed with auto-generated test names via t.Run(), and they con...
Microshift Test Compatibility	✅ Passed	The PR adds standard Go unit tests (using testing.T), not Ginkgo e2e tests. The custom check applies only to new Ginkgo e2e tests (It(), Describe(), etc.), so it does not apply here.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests added. Changes are limited to unit test cases in a standard Go testing file, which do not require SNO compatibility checks.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only test/monitoring framework code (pkg/monitortests) that observes node state; contains no deployment manifests, operators, controllers, or scheduling constraints (affinity, topology...
Ote Binary Stdout Contract	✅ Passed	PR changes contain no stdout writes in process-level code (init, main, TestMain, BeforeSuite, top-level var/const). All fmt/logrus calls are within function bodies or return strings without printing.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The changes add standard Go unit tests, not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests; it is not applicable here.
No-Weak-Crypto	✅ Passed	The modified files (node.go and node_test.go) are Kubernetes node monitoring and test code with no cryptographic implementations, weak cipher usage, or secret comparison logic.
Container-Privileges	✅ Passed	PR modifies only Go test files (node.go, node_test.go) with no container manifests or privilege-escalation settings introduced.
No-Sensitive-Data-In-Logs	✅ Passed	The logging in node.go line 418 uses OldLocator() and OldMessage() to log node readiness status. The data logged includes node names, timestamps, and Kubernetes condition messages (like NetworkPlug...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/monitortests/node/watchnodes/node_test.go (1)

324-449: ⚡ Quick win

Add a non-overlapping rollout case to lock time-window behavior.

The new cases cover “overlap” and “no rollout,” but not “rollout present with non-overlapping timestamps.” Adding that case (expecting one failure) will guard against regressions in overlap matching.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/node/watchnodes/node_test.go` around lines 324 - 449, Add a
new test case after the existing ones that covers the scenario where a node
unexpected ready event with NetworkPluginNotReady cause occurs at the same time
as a network operator rollout (Progressing=True), but with non-overlapping
timestamps. Create a test case with the UnexpectedNotReady interval with
NetworkPluginNotReady in the AnnotationCause (similar to the first case), but
have the network operator Progressing interval occur at a different time that
does not bracket the node error (for example, have the network rollout occur
earlier or later). Set the expected field to contain one failure string matching
the node error format (similar to the second case) and set unexpectedReason to
monitorapi.NodeUnexpectedReadyReason to verify that the overlap matching logic
correctly requires temporal intersection before suppressing the failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/node/watchnodes/node.go`:
- Around line 409-414: The issue is that the `intervalStartDuring` helper
function used at line 413 has overly permissive comparison logic that returns
true whenever any network Progressing interval exists, regardless of actual
timestamp overlap. The helper currently uses an OR condition (>= from || <= to)
which matches too broadly. Fix the `intervalStartDuring` helper function to
implement proper interval overlap logic that correctly verifies the event's
timestamp actually falls within the bounds of the network progressing interval,
rather than just checking if the timestamp is on one side or the other of the
interval.

---

Nitpick comments:
In `@pkg/monitortests/node/watchnodes/node_test.go`:
- Around line 324-449: Add a new test case after the existing ones that covers
the scenario where a node unexpected ready event with NetworkPluginNotReady
cause occurs at the same time as a network operator rollout (Progressing=True),
but with non-overlapping timestamps. Create a test case with the
UnexpectedNotReady interval with NetworkPluginNotReady in the AnnotationCause
(similar to the first case), but have the network operator Progressing interval
occur at a different time that does not bracket the node error (for example,
have the network rollout occur earlier or later). Set the expected field to
contain one failure string matching the node error format (similar to the second
case) and set unexpectedReason to monitorapi.NodeUnexpectedReadyReason to verify
that the overlap matching logic correctly requires temporal intersection before
suppressing the failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7382de1a-2355-4672-9096-b8013baaa6df

📥 Commits

Reviewing files that changed from the base of the PR and between 135cf91 and 3426c8b.

📒 Files selected for processing (2)

pkg/monitortests/node/watchnodes/node.go
pkg/monitortests/node/watchnodes/node_test.go

coderabbitai · 2026-06-19T21:11:13Z

+			// Skip NotReady events caused by NetworkPluginNotReady during network operator rollout
+			// NetworkPluginNotReady is a RuntimeStatus reported by cri-o and exposed by kubelet in the condition's message.
+			conditionMsg := unexpectedNodeUnready.Message.Annotations[monitorapi.AnnotationCause]
+			if strings.Contains(conditionMsg, "NetworkPluginNotReady") {
+				if intervalStartDuring(unexpectedNodeUnready, networkProgressingIntervals) {
+					continue


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Network rollout suppression is currently over-broad due to overlap helper semantics.

At Line 413, suppression depends on intervalStartDuring(...), but that helper currently returns true for any non-empty haystack (it uses >= from || <= to). This makes NetworkPluginNotReady events get skipped whenever any network Progressing=True interval exists, even when timestamps do not overlap.

💡 Proposed fix

# pkg/monitortests/node/watchnodes/monitortest.go - if needleStartEqualOrAfterFrom || needleStartEqualOrBeforeTo { + if needleStartEqualOrAfterFrom && needleStartEqualOrBeforeTo { return true }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/node/watchnodes/node.go` around lines 409 - 414, The issue is that the `intervalStartDuring` helper function used at line 413 has overly permissive comparison logic that returns true whenever any network Progressing interval exists, regardless of actual timestamp overlap. The helper currently uses an OR condition (>= from || <= to) which matches too broadly. Fix the `intervalStartDuring` helper function to implement proper interval overlap logic that correctly verifies the event's timestamp actually falls within the bounds of the network progressing interval, rather than just checking if the timestamp is on one side or the other of the interval.

It looks like code rabbit is correct here. However, this would be a latent bug in intervalStartDuring which is used by several other tests. It is possible that this bug is masking failures in these tests.

This PR is attempting to fix a current regression. I do not want to inadvertently trigger a new one by fixing this here. I will propose a follow-on.

@mdbooth, that's a sensible call — fixing the intervalStartDuring logic independently in a focused follow-up avoids unintended side-effects on the other tests that rely on it, and keeps this regression fix clean.

Would you like me to open a GitHub issue to track the intervalStartDuring bug fix? I can include a reference back to this PR and the specific OR→AND logic change needed.

🧠 Learnings used

Learnt from: Chandan9112 Repo: openshift/origin PR: 31182 File: test/extended/node/node_e2e/node.go:0-0 Timestamp: 2026-05-19T12:28:55.867Z Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).

Learnt from: Chandan9112 Repo: openshift/origin PR: 31182 File: test/extended/node/node_e2e/node.go:0-0 Timestamp: 2026-05-19T12:28:55.867Z Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.

Proposed fix: #31320

openshift-merge-bot · 2026-06-19T21:30:06Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

smg247 · 2026-06-19T22:01:53Z

/lgtm

openshift-ci · 2026-06-19T22:02:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, smg247

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [smg247]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smg247 · 2026-06-20T02:12:40Z

/retest-required

mdbooth · 2026-06-20T06:55:24Z

/test e2e-aws-ovn-fips

mdbooth · 2026-06-20T07:12:32Z

This previously passed. In the passing run I have confirmed:

The CNI rollout occurred
The Nodes were briefly marked NodeReady=False (confirmed from NodeNotReady events)

As the test passed, the exclusion seems to have worked. I'll run it again anyway for confidence as we're waiting on the FIPS job.

/test e2e-gcp-ovn-upgrade

mdbooth · 2026-06-20T09:04:31Z

/verified by CI

openshift-ci-robot · 2026-06-20T09:04:42Z

@mdbooth: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

smg247 · 2026-06-20T10:35:30Z

/override ci/prow/e2e-aws-ovn-fips

openshift-ci · 2026-06-20T10:35:35Z

@smg247: Overrode contexts on behalf of smg247: ci/prow/e2e-aws-ovn-fips

Details

In response to this:

/override ci/prow/e2e-aws-ovn-fips

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-20T10:35:37Z

@mdbooth: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

smg247 · 2026-06-20T10:36:14Z

e2e-aws-ovn-fips failed 3 times for 3 different reasons that seem unrelated to this change

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026

openshift-ci Bot requested review from p0lyn0mial and sjenning June 19, 2026 21:05

coderabbitai Bot requested changes Jun 19, 2026

View reviewed changes

openshift-ci Bot assigned smg247 Jun 19, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 20, 2026

mdbooth mentioned this pull request Jun 20, 2026

NO-JIRA: Fix interval filtering bug in watchnodes tests #31320

Open

openshift-merge-bot Bot merged commit dd883a7 into openshift:main Jun 20, 2026
21 checks passed

mdbooth deleted the TRT-2723 branch June 20, 2026 11:22

Conversation

mdbooth commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-merge-bot Bot commented Jun 19, 2026

Uh oh!

openshift-ci-robot commented Jun 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdbooth Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

mdbooth Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Jun 19, 2026

Uh oh!

smg247 commented Jun 19, 2026

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

smg247 commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

openshift-ci-robot commented Jun 20, 2026

Uh oh!

smg247 commented Jun 20, 2026

Uh oh!

openshift-ci Bot commented Jun 20, 2026

Uh oh!

openshift-ci Bot commented Jun 20, 2026

Uh oh!

smg247 commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdbooth commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot Jun 19, 2026 •

edited

Loading

mdbooth Jun 19, 2026 •

edited

Loading