Skip to content

NO-JIRA: Fix interval filtering bug in watchnodes tests#31320

Open
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:intervalStartDuring
Open

NO-JIRA: Fix interval filtering bug in watchnodes tests#31320
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:intervalStartDuring

Conversation

@mdbooth

@mdbooth mdbooth commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

The test adds a 30s grace period to the window where the network ClusterOperator reports Progressing=True. This is for 2 reasons:

Firstly, CNO reports Progressing=False before the last pod becomes available, so that Node will always be NodeReady=False outside the target window. openshift/cluster-network-operator#3034 is a proposed fix for this.

Secondly, even if the CNO fix is applied, without any grace period this test would be likely to flake under load if kubelet does not update NodeReady fast enough.

Summary by CodeRabbit

  • Bug Fixes

    • Improved accuracy of node “unexpected down” failure detection by tightening the time-interval overlap logic used to decide whether events fall within the same window.
  • Tests

    • Added timing-focused scenarios to verify unexpected node failures are not suppressed when the machine enters the deleted phase before vs. after the node failure interval, including UnexpectedReady and UnexpectedUnreachable cases.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mdbooth: This pull request explicitly references no jira issue.

Details

In response to this:

A bug in intervalStartDuring meant that it would always return true if the target had event had ever occurred, not just within a specified window.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Walkthrough

The intervalStartDuring predicate is refactored to accept a configurable grace duration and enforces stricter AND-combined bounds for interval containment. Machine-deletion overlap checks use a 0s grace period, while NetworkPluginNotReady skip logic applies a 30s grace window. Four test cases validate suppression behavior at grace-window boundaries.

Changes

Interval Overlap Predicate Fix with Grace Period Parametrization

Layer / File(s) Summary
intervalStartDuring predicate refactoring
pkg/monitortests/node/watchnodes/monitortest.go
The function now accepts a grace duration parameter and changes the overlap check from OR-combined bounds to AND-combined bounds requiring needle.From to satisfy both lower and upper constraints: curr.Fromneedle.Fromcurr.To + grace.
Grace period application to failure detection
pkg/monitortests/node/watchnodes/node.go
Machine-deletion overlap check uses 0 grace duration; NetworkPluginNotReady skip logic applies 30*time.Second grace, extending the suppression window relative to network operator progressing intervals.
Boundary and grace-window test coverage
pkg/monitortests/node/watchnodes/node_test.go
Adds four table-driven cases validating that NetworkPluginNotReady failures are suppressed within the grace window but emitted beyond it, and that node failures occurring before machine deletion starts or after it completes are properly reported.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • openshift/origin#31319: Both PRs modify interval timing logic for suppressing node-down failures during network operator rollout; this PR tightens the interval containment predicate and introduces grace-period parametrization, while the related PR adds the AnnotationCause-based NetworkPluginNotReady exclusion trigger.

Suggested reviewers

  • p0lyn0mial
  • sjenning
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a bug in the intervalStartDuring function used in watchnodes tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR contains no Ginkgo tests. Modified files use standard Go testing with table-driven tests (t.Run), not Ginkgo (It/Describe/Context). Test names are static, descriptive strings without dynamic...
Test Structure And Quality ✅ Passed Test cases follow single responsibility, have no setup/cleanup needs, include meaningful assertion messages, and are consistent with codebase patterns. Pure function tests with static data need no...
Microshift Test Compatibility ✅ Passed The custom check requires assessment of new Ginkgo e2e tests. This PR adds test cases to existing *testing.T unit tests only; no Ginkgo e2e tests (It(), Describe(), etc.) are present.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request adds test cases only to Go unit tests (TestReportUnexpectedNodeDownFailures in node_test.go), not to Ginkgo e2e tests. The code uses standard Go testing.T imports with no Ginkgo pa...
Topology-Aware Scheduling Compatibility ✅ Passed Changes are to monitoring test utilities (monitortest.go, node.go, node_test.go), not deployment manifests, operator code, or controllers. No scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations detected. The modified code contains no process-level functions, no direct stdout writes, no klog writes, and all fmt.Sprintf usages properly assign results...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies standard Go unit tests (testing.T), not Ginkgo e2e tests. Tests contain no IPv4 assumptions, hardcoded IP addresses, or external connectivity requirements.
No-Weak-Crypto ✅ Passed No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons detected. Code uses standard time-based interval logic only.
Container-Privileges ✅ Passed This PR modifies only Go test source files with no container/K8s manifests. The container-privileges check is inapplicable to pure Go test code.
No-Sensitive-Data-In-Logs ✅ Passed PR modifies monitoring test code to log Kubernetes diagnostic information (node status, condition messages, timestamps) which are not sensitive data. No passwords, tokens, API keys, PII, or custome...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mdbooth
Once this PR has been reviewed and has the lgtm label, please assign jogeo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from deads2k and p0lyn0mial June 19, 2026 21:47
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 19, 2026
@mdbooth mdbooth force-pushed the intervalStartDuring branch from d846c68 to 18a4f0a Compare June 20, 2026 06:53
@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Unit test flaked because it couldn't reach google.com.

/test unit

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Confirmed e2e-gcp-ovn-upgrade failed due to #31319

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2026
@mdbooth mdbooth force-pushed the intervalStartDuring branch from 18a4f0a to c9c4c4d Compare June 20, 2026 11:41
@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/pipeline auto

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

The /pipeline auto command is only available for LGTM-mode repositories. For repositories in automatic mode, second-stage tests are already triggered automatically.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

e2e-gcp-ovn-upgrade failed due to a timing issue: the network CO set Progressing=False 8 seconds before the last node became NodeReady=false. This looks like a bug in CNO.

Possible fixes:

  • Fix CNO to wait for the CNI DS rollout to have all pods available before reporting Progressing=False
  • Add a fudge-factor to the interval in the monitor test

Preference is for the former.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2026
@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-vsphere-ovn e2e-aws-ovn-fips

@mdbooth

mdbooth commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/testwith openshift/origin/main/e2e-gcp-ovn-upgrade openshift/cluster-network-operator#3034

A bug in intervalStartDuring meant that it would always return true if
the target had event had ever occurred, not just within a specified
window.
@mdbooth mdbooth force-pushed the intervalStartDuring branch from c9c4c4d to e3a4331 Compare June 20, 2026 20:58
@openshift-ci

openshift-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit e3a4331 link true /test unit
ci/prow/e2e-gcp-ovn-upgrade c9c4c4d link true /test e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants