NO-JIRA: Fix interval filtering bug in watchnodes tests by mdbooth · Pull Request #31320 · openshift/origin

mdbooth · 2026-06-19T21:47:02Z

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

The test adds a 30s grace period to the window where the network ClusterOperator reports Progressing=True. This is for 2 reasons:

Firstly, CNO reports Progressing=False before the last pod becomes available, so that Node will always be NodeReady=False outside the target window. openshift/cluster-network-operator#3034 is a proposed fix for this.

Secondly, even if the CNO fix is applied, without any grace period this test would be likely to flake under load if kubelet does not update NodeReady fast enough.

Summary by CodeRabbit

Bug Fixes
- Improved accuracy of node “unexpected down” failure detection by tightening the time-interval overlap logic used to decide whether events fall within the same window.
Tests
- Added timing-focused scenarios to verify unexpected node failures are not suppressed when the machine enters the deleted phase before vs. after the node failure interval, including UnexpectedReady and UnexpectedUnreachable cases.

openshift-merge-bot · 2026-06-19T21:47:05Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-06-19T21:47:06Z

@mdbooth: This pull request explicitly references no jira issue.

Details

In response to this:

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-19T21:47:22Z

Walkthrough

The intervalStartDuring predicate is refactored to accept a configurable grace duration and enforces stricter AND-combined bounds for interval containment. Machine-deletion overlap checks use a 0s grace period, while NetworkPluginNotReady skip logic applies a 30s grace window. Four test cases validate suppression behavior at grace-window boundaries.

Changes

Interval Overlap Predicate Fix with Grace Period Parametrization

Layer / File(s)	Summary
`intervalStartDuring` predicate refactoring `pkg/monitortests/node/watchnodes/monitortest.go`	The function now accepts a `grace` duration parameter and changes the overlap check from OR-combined bounds to AND-combined bounds requiring `needle.From` to satisfy both lower and upper constraints: `curr.From` ≤ `needle.From` ≤ `curr.To + grace`.
Grace period application to failure detection `pkg/monitortests/node/watchnodes/node.go`	Machine-deletion overlap check uses `0` grace duration; NetworkPluginNotReady skip logic applies `30*time.Second` grace, extending the suppression window relative to network operator progressing intervals.
Boundary and grace-window test coverage `pkg/monitortests/node/watchnodes/node_test.go`	Adds four table-driven cases validating that NetworkPluginNotReady failures are suppressed within the grace window but emitted beyond it, and that node failures occurring before machine deletion starts or after it completes are properly reported.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

openshift/origin#31319: Both PRs modify interval timing logic for suppressing node-down failures during network operator rollout; this PR tightens the interval containment predicate and introduces grace-period parametrization, while the related PR adds the AnnotationCause-based NetworkPluginNotReady exclusion trigger.

Suggested reviewers

p0lyn0mial
sjenning

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing a bug in the intervalStartDuring function used in watchnodes tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR contains no Ginkgo tests. Modified files use standard Go testing with table-driven tests (t.Run), not Ginkgo (It/Describe/Context). Test names are static, descriptive strings without dynamic...
Test Structure And Quality	✅ Passed	Test cases follow single responsibility, have no setup/cleanup needs, include meaningful assertion messages, and are consistent with codebase patterns. Pure function tests with static data need no...
Microshift Test Compatibility	✅ Passed	The custom check requires assessment of new Ginkgo e2e tests. This PR adds test cases to existing *testing.T unit tests only; no Ginkgo e2e tests (It(), Describe(), etc.) are present.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The pull request adds test cases only to Go unit tests (TestReportUnexpectedNodeDownFailures in node_test.go), not to Ginkgo e2e tests. The code uses standard Go testing.T imports with no Ginkgo pa...
Topology-Aware Scheduling Compatibility	✅ Passed	Changes are to monitoring test utilities (monitortest.go, node.go, node_test.go), not deployment manifests, operator code, or controllers. No scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	No OTE Binary Stdout Contract violations detected. The modified code contains no process-level functions, no direct stdout writes, no klog writes, and all fmt.Sprintf usages properly assign results...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR modifies standard Go unit tests (testing.T), not Ginkgo e2e tests. Tests contain no IPv4 assumptions, hardcoded IP addresses, or external connectivity requirements.
No-Weak-Crypto	✅ Passed	No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons detected. Code uses standard time-based interval logic only.
Container-Privileges	✅ Passed	This PR modifies only Go test source files with no container/K8s manifests. The container-privileges check is inapplicable to pure Go test code.
No-Sensitive-Data-In-Logs	✅ Passed	PR modifies monitoring test code to log Kubernetes diagnostic information (node status, condition messages, timestamps) which are not sensitive data. No passwords, tokens, API keys, PII, or custome...
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-19T21:47:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mdbooth
Once this PR has been reviewed and has the lgtm label, please assign jogeo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mdbooth · 2026-06-20T07:15:18Z

Unit test flaked because it couldn't reach google.com.

/test unit

openshift-merge-bot · 2026-06-20T07:20:16Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

mdbooth · 2026-06-20T09:50:32Z

Confirmed e2e-gcp-ovn-upgrade failed due to #31319

mdbooth · 2026-06-20T11:41:57Z

/pipeline auto

openshift-merge-bot · 2026-06-20T11:41:59Z

The /pipeline auto command is only available for LGTM-mode repositories. For repositories in automatic mode, second-stage tests are already triggered automatically.

openshift-merge-bot · 2026-06-20T12:01:28Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

mdbooth · 2026-06-20T14:32:01Z

e2e-gcp-ovn-upgrade failed due to a timing issue: the network CO set Progressing=False 8 seconds before the last node became NodeReady=false. This looks like a bug in CNO.

Possible fixes:

Fix CNO to wait for the CNI DS rollout to have all pods available before reporting Progressing=False
Add a fudge-factor to the interval in the monitor test

Preference is for the former.

mdbooth · 2026-06-20T16:25:43Z

/test e2e-vsphere-ovn e2e-aws-ovn-fips

mdbooth · 2026-06-20T16:30:33Z

/testwith openshift/origin/main/e2e-gcp-ovn-upgrade openshift/cluster-network-operator#3034

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

openshift-ci · 2026-06-20T21:24:06Z

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`e3a4331`	link	true	`/test unit`
ci/prow/e2e-gcp-ovn-upgrade	`c9c4c4d`	link	true	`/test e2e-gcp-ovn-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026

mdbooth mentioned this pull request Jun 19, 2026

TRT-2723: Ignore NodeReady=False during a CNI rollout #31319

Merged

openshift-ci Bot requested review from deads2k and p0lyn0mial June 19, 2026 21:47

coderabbitai Bot approved these changes Jun 19, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 19, 2026

mdbooth force-pushed the intervalStartDuring branch from d846c68 to 18a4f0a Compare June 20, 2026 06:53

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2026

mdbooth force-pushed the intervalStartDuring branch from 18a4f0a to c9c4c4d Compare June 20, 2026 11:41

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2026

mdbooth mentioned this pull request Jun 20, 2026

OCPBUGS-90541: Don't set Progressing=False until all pods available openshift/cluster-network-operator#3034

Open

Fix interval filtering bug in watchnodes tests

e3a4331

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

mdbooth force-pushed the intervalStartDuring branch from c9c4c4d to e3a4331 Compare June 20, 2026 20:58

Conversation

mdbooth commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 19, 2026

Uh oh!

openshift-ci-robot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

mdbooth commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 20, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

mdbooth commented Jun 20, 2026

Uh oh!

openshift-ci Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mdbooth commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

mdbooth commented Jun 20, 2026 •

edited

Loading