Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 24 additions & 3 deletions pkg/monitortests/node/watchnodes/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ func startNodeMonitoring(ctx context.Context, m monitorapi.RecorderWriter, clien
// We want to fail the monitor test if a node goes not ready
// if it is unexpected.
// Unexpected in this case means that it went not ready outside
// of a MCO config update.
// of a MCO config update or CNI rollout.
func(node, oldNode *corev1.Node) []monitorapi.Interval {
var intervals []monitorapi.Interval

Expand All @@ -167,11 +167,16 @@ func startNodeMonitoring(ctx context.Context, m monitorapi.RecorderWriter, clien

now := time.Now()
if isOldNodeReady && !isNewNodeReady && isConfigTheSame && !isNodeUnscheduable {
msg := monitorapi.NewMessage().Reason(monitorapi.NodeUnexpectedReadyReason).
HumanMessage("unexpected node not ready")
// Extract the NodeReady condition message for downstream filtering
if c := findNodeCondition(node.Status.Conditions, corev1.NodeReady, 0); c != nil && c.Message != "" {
msg = msg.WithAnnotation(monitorapi.AnnotationCause, c.Message)
}
intervals = append(intervals,
monitorapi.NewInterval(monitorapi.SourceUnexpectedReady, monitorapi.Error).
Locator(monitorapi.NewLocator().NodeFromName(node.Name)).
Message(monitorapi.NewMessage().Reason(monitorapi.NodeUnexpectedReadyReason).
HumanMessage("unexpected node not ready")).
Message(msg).
Display().
Build(now, now))
}
Expand Down Expand Up @@ -370,6 +375,13 @@ func reportUnexpectedNodeDownFailures(intervals monitorapi.Intervals, targetedRe
return false
})

// Get all network ClusterOperator Progressing=True intervals
networkProgressingIntervals := intervals.Filter(func(eventInterval monitorapi.Interval) bool {
return eventInterval.Locator.Keys[monitorapi.LocatorClusterOperatorKey] == "network" &&
eventInterval.Message.Annotations[monitorapi.AnnotationCondition] == "Progressing" &&
eventInterval.Message.Annotations[monitorapi.AnnotationStatus] == "True"
})

// We need to build a map of node to machine name
nodeNameToMachineName := map[string]string{}
// Given the deleted machine, store the deleted intervals.
Expand All @@ -394,6 +406,15 @@ func reportUnexpectedNodeDownFailures(intervals monitorapi.Intervals, targetedRe
machineDeletingIntervals := machineNameToDeletePhases[machineNameForNode]

if !intervalStartDuring(unexpectedNodeUnready, machineDeletingIntervals) {
// Skip NotReady events caused by NetworkPluginNotReady during network operator rollout
// NetworkPluginNotReady is a RuntimeStatus reported by cri-o and exposed by kubelet in the condition's message.
conditionMsg := unexpectedNodeUnready.Message.Annotations[monitorapi.AnnotationCause]
if strings.Contains(conditionMsg, "NetworkPluginNotReady") {
if intervalStartDuring(unexpectedNodeUnready, networkProgressingIntervals) {
continue
Comment on lines +409 to +414

@coderabbitai coderabbitai Bot Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Network rollout suppression is currently over-broad due to overlap helper semantics.

At Line 413, suppression depends on intervalStartDuring(...), but that helper currently returns true for any non-empty haystack (it uses >= from || <= to). This makes NetworkPluginNotReady events get skipped whenever any network Progressing=True interval exists, even when timestamps do not overlap.

💡 Proposed fix
# pkg/monitortests/node/watchnodes/monitortest.go
-       if needleStartEqualOrAfterFrom || needleStartEqualOrBeforeTo {
+       if needleStartEqualOrAfterFrom && needleStartEqualOrBeforeTo {
            return true
        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/node/watchnodes/node.go` around lines 409 - 414, The issue
is that the `intervalStartDuring` helper function used at line 413 has overly
permissive comparison logic that returns true whenever any network Progressing
interval exists, regardless of actual timestamp overlap. The helper currently
uses an OR condition (>= from || <= to) which matches too broadly. Fix the
`intervalStartDuring` helper function to implement proper interval overlap logic
that correctly verifies the event's timestamp actually falls within the bounds
of the network progressing interval, rather than just checking if the timestamp
is on one side or the other of the interval.

@mdbooth mdbooth Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like code rabbit is correct here. However, this would be a latent bug in intervalStartDuring which is used by several other tests. It is possible that this bug is masking failures in these tests.

This PR is attempting to fix a current regression. I do not want to inadvertently trigger a new one by fixing this here. I will propose a follow-on.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdbooth, that's a sensible call — fixing the intervalStartDuring logic independently in a focused follow-up avoids unintended side-effects on the other tests that rely on it, and keeps this regression fix clean.

Would you like me to open a GitHub issue to track the intervalStartDuring bug fix? I can include a reference back to this PR and the specific OR→AND logic change needed.


🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).

Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix: #31320

}
}

failures = append(failures, fmt.Sprintf("%v - %v at from: %v - to: %v", unexpectedNodeUnready.Locator.OldLocator(), unexpectedNodeUnready.Message.OldMessage(), unexpectedNodeUnready.From, unexpectedNodeUnready.To))
}
}
Expand Down
126 changes: 126 additions & 0 deletions pkg/monitortests/node/watchnodes/node_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,132 @@ func TestReportUnexpectedNodeDownFailures(t *testing.T) {
expected: []string{},
unexpectedReason: monitorapi.NodeUnexpectedUnreachableReason,
},
{
name: "node unexpected ready caused by NetworkPluginNotReady during network operator rollout",
rawIntervals: monitorapi.Intervals{
// The UnexpectedNotReady interval with NetworkPluginNotReady in AnnotationCause
{
Condition: monitorapi.Condition{
Level: monitorapi.Error,
Locator: monitorapi.Locator{
Type: monitorapi.LocatorTypeNode,
Keys: map[monitorapi.LocatorKey]string{
"node": "node1",
},
},
Message: monitorapi.Message{
Reason: monitorapi.NodeUnexpectedReadyReason,
HumanMessage: "unexpected node not ready",
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationReason: "UnexpectedNotReady",
monitorapi.AnnotationCause: "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?",
},
},
},
From: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
To: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
},
// The network CO Progressing=True interval that brackets it
{
Condition: monitorapi.Condition{
Level: monitorapi.Warning,
Locator: monitorapi.Locator{
Type: monitorapi.LocatorTypeClusterOperator,
Keys: map[monitorapi.LocatorKey]string{
monitorapi.LocatorClusterOperatorKey: "network",
},
},
Message: monitorapi.Message{
HumanMessage: "Progressing because OVN is updating",
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationCondition: "Progressing",
monitorapi.AnnotationStatus: "True",
},
},
},
From: utility.SystemdJournalLogTime("Nov 11 19:45:00", 2024),
To: utility.SystemdJournalLogTime("Nov 11 19:47:00", 2024),
},
},
expected: []string{},
unexpectedReason: monitorapi.NodeUnexpectedReadyReason,
},
{
name: "node unexpected ready caused by NetworkPluginNotReady WITHOUT network operator rollout",
rawIntervals: monitorapi.Intervals{
{
Condition: monitorapi.Condition{
Level: monitorapi.Error,
Locator: monitorapi.Locator{
Type: monitorapi.LocatorTypeNode,
Keys: map[monitorapi.LocatorKey]string{
"node": "node1",
},
},
Message: monitorapi.Message{
Reason: monitorapi.NodeUnexpectedReadyReason,
HumanMessage: "unexpected node not ready",
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationReason: "UnexpectedNotReady",
monitorapi.AnnotationCause: "container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?",
},
},
},
From: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
To: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
},
},
expected: []string{"node/node1 - cause/container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: no CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? reason/UnexpectedNotReady unexpected node not ready at from: 2024-11-11 19:46:00 +0000 UTC - to: 2024-11-11 19:46:00 +0000 UTC"},
unexpectedReason: monitorapi.NodeUnexpectedReadyReason,
},
{
name: "node unexpected ready NOT caused by NetworkPluginNotReady during network operator rollout",
rawIntervals: monitorapi.Intervals{
{
Condition: monitorapi.Condition{
Level: monitorapi.Error,
Locator: monitorapi.Locator{
Type: monitorapi.LocatorTypeNode,
Keys: map[monitorapi.LocatorKey]string{
"node": "node1",
},
},
Message: monitorapi.Message{
Reason: monitorapi.NodeUnexpectedReadyReason,
HumanMessage: "unexpected node not ready",
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationReason: "UnexpectedNotReady",
monitorapi.AnnotationCause: "kubelet stopped posting node status",
},
},
},
From: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
To: utility.SystemdJournalLogTime("Nov 11 19:46:00", 2024),
},
{
Condition: monitorapi.Condition{
Level: monitorapi.Warning,
Locator: monitorapi.Locator{
Type: monitorapi.LocatorTypeClusterOperator,
Keys: map[monitorapi.LocatorKey]string{
monitorapi.LocatorClusterOperatorKey: "network",
},
},
Message: monitorapi.Message{
HumanMessage: "Progressing because OVN is updating",
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationCondition: "Progressing",
monitorapi.AnnotationStatus: "True",
},
},
},
From: utility.SystemdJournalLogTime("Nov 11 19:45:00", 2024),
To: utility.SystemdJournalLogTime("Nov 11 19:47:00", 2024),
},
},
expected: []string{"node/node1 - cause/kubelet stopped posting node status reason/UnexpectedNotReady unexpected node not ready at from: 2024-11-11 19:46:00 +0000 UTC - to: 2024-11-11 19:46:00 +0000 UTC"},
unexpectedReason: monitorapi.NodeUnexpectedReadyReason,
},
}

for _, tc := range testCases {
Expand Down