Skip to content

TRT-2723: Ignore NodeReady=False during a CNI rollout#31319

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mdbooth:TRT-2723
Jun 20, 2026
Merged

TRT-2723: Ignore NodeReady=False during a CNI rollout#31319
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mdbooth:TRT-2723

Conversation

@mdbooth

@mdbooth mdbooth commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when:

  • the network ClusterOperator reports Progressing=True, and
  • the message contains NetworkPluginNotReady reported by cri-o

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced node readiness error messages to include underlying condition details for better diagnostics.
  • Bug Fixes

    • Improved handling of network-related node readiness issues to prevent false failure reporting during network operator updates.
  • Tests

    • Added comprehensive test coverage for network plugin readiness scenarios.

A CNI rollout is expected to cause each Node to become briefly not Ready
while the ovn-kubernetes pod is replaced. This change adds an additional
exclusion to the node watch test for NodeReady=False when:
* the network ClusterOperator reports Progressing=True, and
* the message contains NetworkPluginNotReady reported by cri-o
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot

openshift-ci-robot commented Jun 19, 2026

Copy link
Copy Markdown

@mdbooth: This pull request references TRT-2723 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

A CNI rollout is expected to cause each Node to become briefly not Ready while the ovn-kubernetes pod is replaced. This change adds an additional exclusion to the node watch test for NodeReady=False when:

  • the network ClusterOperator reports Progressing=True, and
  • the message contains NetworkPluginNotReady reported by cri-o

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Walkthrough

In node.go, the unexpected-not-ready interval builder now extracts the NodeReady condition message and stores it as AnnotationCause. reportUnexpectedNodeDownFailures pre-computes networkProgressingIntervals (ClusterOperator network, Progressing=True) and skips failures whose AnnotationCause contains NetworkPluginNotReady when they overlap those intervals. Three new test cases validate all three suppression branches.

Changes

NetworkPluginNotReady Failure Suppression

Layer / File(s) Summary
Cause annotation and network-rollout suppression logic
pkg/monitortests/node/watchnodes/node.go
Unexpected-not-ready intervals now store the NodeReady condition message in AnnotationCause. reportUnexpectedNodeDownFailures pre-filters clusteroperator/network Progressing=True intervals and skips counting a failure when AnnotationCause contains NetworkPluginNotReady and the interval overlaps a network progressing window. Comment updated to mention CNI rollout exclusion.
Test cases for the three suppression branches
pkg/monitortests/node/watchnodes/node_test.go
Three new TestReportUnexpectedNodeDownFailures cases: NetworkPluginNotReady during a network rollout (expects zero failures), NetworkPluginNotReady with no rollout (expects one failure), and a non-NetworkPluginNotReady cause during a rollout (expects one failure).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: ignoring NodeReady=False during CNI rollouts, which directly aligns with the primary modifications across both modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. The three new test cases use descriptive static strings without dynamic values like pod names, timestamps, UUIDs, node names, IP addresses, or...
Test Structure And Quality ✅ Passed The three new test cases follow Go unit test best practices: each tests one scenario, all assertions include expected/actual values prefixed with auto-generated test names via t.Run(), and they con...
Microshift Test Compatibility ✅ Passed The PR adds standard Go unit tests (using testing.T), not Ginkgo e2e tests. The custom check applies only to new Ginkgo e2e tests (It(), Describe(), etc.), so it does not apply here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. Changes are limited to unit test cases in a standard Go testing file, which do not require SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test/monitoring framework code (pkg/monitortests) that observes node state; contains no deployment manifests, operators, controllers, or scheduling constraints (affinity, topology...
Ote Binary Stdout Contract ✅ Passed PR changes contain no stdout writes in process-level code (init, main, TestMain, BeforeSuite, top-level var/const). All fmt/logrus calls are within function bodies or return strings without printing.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The changes add standard Go unit tests, not Ginkgo e2e tests. The custom check applies only to Ginkgo e2e tests; it is not applicable here.
No-Weak-Crypto ✅ Passed The modified files (node.go and node_test.go) are Kubernetes node monitoring and test code with no cryptographic implementations, weak cipher usage, or secret comparison logic.
Container-Privileges ✅ Passed PR modifies only Go test files (node.go, node_test.go) with no container manifests or privilege-escalation settings introduced.
No-Sensitive-Data-In-Logs ✅ Passed The logging in node.go line 418 uses OldLocator() and OldMessage() to log node readiness status. The data logged includes node names, timestamps, and Kubernetes condition messages (like NetworkPlug...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from p0lyn0mial and sjenning June 19, 2026 21:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/monitortests/node/watchnodes/node_test.go (1)

324-449: ⚡ Quick win

Add a non-overlapping rollout case to lock time-window behavior.

The new cases cover “overlap” and “no rollout,” but not “rollout present with non-overlapping timestamps.” Adding that case (expecting one failure) will guard against regressions in overlap matching.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/node/watchnodes/node_test.go` around lines 324 - 449, Add a
new test case after the existing ones that covers the scenario where a node
unexpected ready event with NetworkPluginNotReady cause occurs at the same time
as a network operator rollout (Progressing=True), but with non-overlapping
timestamps. Create a test case with the UnexpectedNotReady interval with
NetworkPluginNotReady in the AnnotationCause (similar to the first case), but
have the network operator Progressing interval occur at a different time that
does not bracket the node error (for example, have the network rollout occur
earlier or later). Set the expected field to contain one failure string matching
the node error format (similar to the second case) and set unexpectedReason to
monitorapi.NodeUnexpectedReadyReason to verify that the overlap matching logic
correctly requires temporal intersection before suppressing the failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/node/watchnodes/node.go`:
- Around line 409-414: The issue is that the `intervalStartDuring` helper
function used at line 413 has overly permissive comparison logic that returns
true whenever any network Progressing interval exists, regardless of actual
timestamp overlap. The helper currently uses an OR condition (>= from || <= to)
which matches too broadly. Fix the `intervalStartDuring` helper function to
implement proper interval overlap logic that correctly verifies the event's
timestamp actually falls within the bounds of the network progressing interval,
rather than just checking if the timestamp is on one side or the other of the
interval.

---

Nitpick comments:
In `@pkg/monitortests/node/watchnodes/node_test.go`:
- Around line 324-449: Add a new test case after the existing ones that covers
the scenario where a node unexpected ready event with NetworkPluginNotReady
cause occurs at the same time as a network operator rollout (Progressing=True),
but with non-overlapping timestamps. Create a test case with the
UnexpectedNotReady interval with NetworkPluginNotReady in the AnnotationCause
(similar to the first case), but have the network operator Progressing interval
occur at a different time that does not bracket the node error (for example,
have the network rollout occur earlier or later). Set the expected field to
contain one failure string matching the node error format (similar to the second
case) and set unexpectedReason to monitorapi.NodeUnexpectedReadyReason to verify
that the overlap matching logic correctly requires temporal intersection before
suppressing the failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7382de1a-2355-4672-9096-b8013baaa6df

📥 Commits

Reviewing files that changed from the base of the PR and between 135cf91 and 3426c8b.

📒 Files selected for processing (2)
  • pkg/monitortests/node/watchnodes/node.go
  • pkg/monitortests/node/watchnodes/node_test.go

Comment on lines +409 to +414
// Skip NotReady events caused by NetworkPluginNotReady during network operator rollout
// NetworkPluginNotReady is a RuntimeStatus reported by cri-o and exposed by kubelet in the condition's message.
conditionMsg := unexpectedNodeUnready.Message.Annotations[monitorapi.AnnotationCause]
if strings.Contains(conditionMsg, "NetworkPluginNotReady") {
if intervalStartDuring(unexpectedNodeUnready, networkProgressingIntervals) {
continue

@coderabbitai coderabbitai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Network rollout suppression is currently over-broad due to overlap helper semantics.

At Line 413, suppression depends on intervalStartDuring(...), but that helper currently returns true for any non-empty haystack (it uses >= from || <= to). This makes NetworkPluginNotReady events get skipped whenever any network Progressing=True interval exists, even when timestamps do not overlap.

💡 Proposed fix
# pkg/monitortests/node/watchnodes/monitortest.go
-       if needleStartEqualOrAfterFrom || needleStartEqualOrBeforeTo {
+       if needleStartEqualOrAfterFrom && needleStartEqualOrBeforeTo {
            return true
        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/node/watchnodes/node.go` around lines 409 - 414, The issue
is that the `intervalStartDuring` helper function used at line 413 has overly
permissive comparison logic that returns true whenever any network Progressing
interval exists, regardless of actual timestamp overlap. The helper currently
uses an OR condition (>= from || <= to) which matches too broadly. Fix the
`intervalStartDuring` helper function to implement proper interval overlap logic
that correctly verifies the event's timestamp actually falls within the bounds
of the network progressing interval, rather than just checking if the timestamp
is on one side or the other of the interval.

@mdbooth mdbooth Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like code rabbit is correct here. However, this would be a latent bug in intervalStartDuring which is used by several other tests. It is possible that this bug is masking failures in these tests.

This PR is attempting to fix a current regression. I do not want to inadvertently trigger a new one by fixing this here. I will propose a follow-on.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdbooth, that's a sensible call — fixing the intervalStartDuring logic independently in a focused follow-up avoids unintended side-effects on the other tests that rely on it, and keeps this regression fix clean.

Would you like me to open a GitHub issue to track the intervalStartDuring bug fix? I can include a reference back to this PR and the specific OR→AND logic change needed.


🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).

Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix: #31320

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@smg247

smg247 commented Jun 19, 2026

Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2026
@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, smg247

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026
@smg247

smg247 commented Jun 20, 2026

Copy link
Copy Markdown
Member

/retest-required

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-ovn-fips

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

This previously passed. In the passing run I have confirmed:

  • The CNI rollout occurred
  • The Nodes were briefly marked NodeReady=False (confirmed from NodeNotReady events)

As the test passed, the exclusion seems to have worked. I'll run it again anyway for confidence as we're waiting on the FIPS job.

/test e2e-gcp-ovn-upgrade

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 20, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mdbooth: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@smg247

smg247 commented Jun 20, 2026

Copy link
Copy Markdown
Member

/override ci/prow/e2e-aws-ovn-fips

@openshift-ci

openshift-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@smg247: Overrode contexts on behalf of smg247: ci/prow/e2e-aws-ovn-fips

Details

In response to this:

/override ci/prow/e2e-aws-ovn-fips

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@smg247

smg247 commented Jun 20, 2026

Copy link
Copy Markdown
Member

e2e-aws-ovn-fips failed 3 times for 3 different reasons that seem unrelated to this change

@openshift-merge-bot openshift-merge-bot Bot merged commit dd883a7 into openshift:main Jun 20, 2026
21 checks passed
@mdbooth mdbooth deleted the TRT-2723 branch June 20, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants