Skip to content

OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bitoku:crio-goroutine-dump-late-test
Apr 27, 2026
Merged

OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bitoku:crio-goroutine-dump-late-test

Conversation

@bitoku
Copy link
Copy Markdown
Contributor

@bitoku bitoku commented Apr 15, 2026

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Summary by CodeRabbit

  • Tests
    • Added a node-level diagnostic that collects container runtime goroutine dumps from each node, verifies a dump is produced, and fails if dumps are missing or show stuck image-pull goroutines with prolonged I/O wait.
    • Added unit tests for the goroutine-dump parser to validate detection of stuck image-pull stacks and edge cases (various durations, missing frames, mixed dumps).

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from dgrisonnet and rphillips April 15, 2026 13:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Ginkgo node test and helper that trigger CRI‑O to write goroutine dumps on each cluster node, collect those dumps (handling PrivateTmp), parse them for IO‑wait goroutines >30 minutes that show net.(*conn).Read followed by docker.(*bodyReader).Read, and fail if dumps are missing or stuck image‑pull goroutines are found.

Changes

Cohort / File(s) Summary
CRI-O Goroutine Dump Test & Parser
test/extended/node/crio_goroutinedump.go
New Ginkgo [sig-node][Late] test that enumerates nodes, finds CRI‑O PID, sends SIGUSR1, waits for /tmp/crio-goroutine-stacks-*.log (handles PrivateTmp via /proc/<pid>/root/tmp), returns sentinel strings on failure, and aggregates per-node results. Adds findStuckImagePulls(dump string) []string to split dumps into goroutine blocks, parse IO wait minutes, filter minutes > 30, and detect blocks where net.(*conn).Read appears before docker.(*bodyReader).Read.
Unit tests for parser
test/extended/node/crio_goroutinedump_test.go
New table-driven Go test TestFindStuckImagePulls with multiple goroutine-dump fixtures covering stuck and non-stuck cases (including versioned containers/image stacks, boundary 30m case, missing frames, mixed dumps, and empty dump) asserting expected match counts.

Sequence Diagram

sequenceDiagram
    participant TestRunner as Test Runner
    participant Node as Cluster Node
    participant CRIO as CRI‑O Process
    participant Parser as Dump Parser

    TestRunner->>Node: run shell to find CRI‑O PID
    Node-->>TestRunner: PID or CRIO_NOT_FOUND
    TestRunner->>CRIO: send SIGUSR1 to PID
    activate CRIO
    CRIO->>CRIO: write /tmp/crio-goroutine-stacks-*.log
    CRIO-->>TestRunner: dump file created (or timeout)
    deactivate CRIO
    TestRunner->>Node: fetch newest dump file (via /proc/<pid>/root/tmp)
    Node-->>TestRunner: goroutine dump content or DUMP_TIMEOUT
    TestRunner->>Parser: call findStuckImagePulls(dump)
    Parser->>Parser: split into goroutine blocks
    Parser->>Parser: extract IO wait minutes, filter > 30
    Parser->>Parser: match stack pattern (net.(*conn).Read -> docker.(*bodyReader).Read)
    Parser-->>TestRunner: list of stuck goroutines
    alt stuck goroutines found
        TestRunner->>TestRunner: fail test and emit dumps
    else no stuck goroutines
        TestRunner->>TestRunner: pass
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test lacks meaningful assertion error messages. While it has single responsibility and appropriate timeouts, assertions at lines 146, 152, and 155-162 use bare framework.ExpectError/ExpectNoError without contextual messages explaining which operations failed or why. Add descriptive failure messages to all assertions explaining what operation was attempted and what failed, such as 'failed to collect goroutine dumps from node: ' + nodeName for context.
Ote Binary Stdout Contract ❓ Inconclusive Code files for the PR are not materialized in the repository, preventing direct inspection for OTE Binary Stdout Contract violations such as unredirected fmt.Print or klog writes at process level. Provide access to the actual code files crio_goroutinedump.go and crio_goroutinedump_test.go to verify no fmt.Print/Printf/Println or unredirected klog writes exist at process level outside of test It() blocks.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Stable And Deterministic Test Names ✅ Passed The Ginkgo test uses static string literals for test names with no dynamic elements like timestamps, UUIDs, or format specifiers.
Microshift Test Compatibility ✅ Passed The test uses only standard Kubernetes Node API and debug utilities with default namespace, contains no OpenShift-specific APIs, and demonstrates MicroShift awareness by handling PrivateTmp behavior.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test iterates over all nodes and verifies CRI-O functionality with assertion len(nodes)>0, which passes on SNO without multi-node assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only diagnostic test code that collects CRI-O goroutine dumps. No deployment manifests, operators, controllers, or workload scheduling constraints are introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New Ginkgo test uses cluster-internal APIs for node enumeration and local node operations with no IPv4-specific assumptions, hardcoded IPs, or external connectivity requirements.
Title check ✅ Passed The title accurately describes the main change: adding a [Late] test that collects CRI-O goroutine dumps via SIGUSR1 signal. This matches the core functionality implemented in the changeset.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 15, 2026

/test

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 84-88: The shellCmd currently grabs the newest
/tmp/crio-goroutine-stacks-*.log which can be a stale file or empty if none
exists; update the logic in the shellCmd string in crio_goroutinedump.go to
record the list (or latest mtime) of existing matching files before sending kill
-USR1, then after signalling poll/wait (with timeout) for a new file whose mtime
is greater than the pre-signal timestamp (or not in the pre-signal list), and
only then cat that new file; ensure the command returns a non-zero exit and
clear error message if no new dump appears within the timeout to avoid silent
bad reads.
- Around line 90-124: The loop currently treats missing CRI-O dumps as warnings
and continues, allowing partial node coverage to pass; modify the logic in the
loop that calls exutil.DebugNodeRetryWithOptionsAndChroot (the block handling
"CRIO_NOT_FOUND" and errors) to instead record failures (e.g., collect node
names into a missingDumps slice) and after the loop assert that missingDumps is
empty (failing the test with a clear message), or alternatively pre-filter the
nodes slice to only include nodes expected to run CRI-O before invoking
exutil.DebugNodeRetryWithOptionsAndChroot; update references to nodes, dumps,
stuckPulls, goroutineHeaderRe and findStuckImagePulls accordingly so
missing/filtered nodes are handled deterministically and the test fails if any
expected node did not produce a dump.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0f2719c-5fb3-41e3-b81a-83372c196fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 9df27cd and 8c04a45.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment thread test/extended/node/crio_goroutinedump.go Outdated
Comment thread test/extended/node/crio_goroutinedump.go Outdated
@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 8c04a45 to 1cc9cb0 Compare April 15, 2026 14:05
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 15, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 1cc9cb0

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 1cc9cb0

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 7, Fail: 2, Flake: 0]

@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 1cc9cb0 to 0418318 Compare April 16, 2026 00:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 86-93: The MARKER temporary file created by MARKER=$(mktemp) can
leak if the script exits early (e.g., kill fails); add a shell trap immediately
after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f
"$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing
kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill
-USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is
always cleaned up even on early failure.
- Around line 143-147: The current block unconditionally calls g.Fail when dumps
are present which causes successful runs to fail; modify the CRI-O goroutine
dump handling in the section that references dumps and g.Fail so that dumps are
written to GinkgoWriter (or test logs/artifacts) instead of failing the test,
and only call g.Fail if there are actual fault conditions such as failedNodes or
stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use
GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional
that preserves failing behavior only when failedNodes or stuckPulls indicate a
real error.
- Around line 123-125: The test currently uses o.Expect(...).To(...) inside the
node loop which aborts the spec on the first malformed CRI-O dump; instead,
introduce a failedNodes collection (e.g., []string failedNodes) declared before
iterating nodes, replace the immediate
o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop
with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 365ad2c5-fee6-4164-bdd6-d5f0aabddbc0

📥 Commits

Reviewing files that changed from the base of the PR and between 1cc9cb0 and 0418318.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment thread test/extended/node/crio_goroutinedump.go Outdated
Comment thread test/extended/node/crio_goroutinedump.go Outdated
Comment thread test/extended/node/crio_goroutinedump.go Outdated
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 16, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 0418318

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-gcp-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 0418318

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 0, Fail: 9, Flake: 0]

@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 0418318 to f1853da Compare April 16, 2026 08:37
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
test/extended/node/crio_goroutinedump.go (2)

178-182: ⚠️ Potential issue | 🟠 Major

Do not force-fail healthy runs just to emit dumps.

Line 180 makes the spec fail whenever dumps exist, so a healthy run can never pass. Emit dumps to GinkgoWriter and keep failure paths tied to real faults (failedNodes / stuckPulls).

Suggested change
-			// Always fail so goroutine dumps are visible in test results.
-			// The test is marked Informing so this won't block merges.
-			if len(dumps) > 0 {
-				g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n")))
-			}
+			// Emit dumps for visibility without forcing failure on success paths.
+			if len(dumps) > 0 {
+				g.By("CRI-O goroutine dumps collected")
+				_, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n"))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 178 - 182, The test
currently calls g.Fail when any CRI-O goroutine dumps exist (variable dumps),
forcing healthy runs to fail; change this so dumps are written to GinkgoWriter
instead and only trigger g.Fail when real failure conditions remain (e.g.,
failedNodes or stuckPulls). Locate the block using dumps and g.Fail in
crio_goroutinedump.go, replace the unconditional g.Fail(fmt.Sprintf(...)) with
code that prints the joined dumps to GinkgoWriter (with context header) and
leaves existing failure logic that checks failedNodes and stuckPulls intact so
only those conditions cause test failures.

158-160: ⚠️ Potential issue | 🟠 Major

Avoid failing mid-loop; aggregate malformed dump nodes and fail once.

Line 158 aborts on the first bad dump format, so you lose complete per-node diagnostics. Append this as a node failure and continue, then assert once after the loop.

Suggested change
-				o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(),
-					"expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output)
+				if !goroutineHeaderRe.MatchString(r.output) {
+					failedNodes = append(failedNodes,
+						fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, output length=%d, got:\n%s", r.name, len(r.output), r.output))
+					continue
+				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 158 - 160, The test
currently aborts on the first malformed per-node CRI-O dump by calling
o.Expect(goroutineHeaderRe.MatchString(r.output)) inside the loop; instead,
collect failures and assert once after the loop. Create a slice (e.g.,
malformedNodes []string) before iterating, replace the inline o.Expect check in
the loop with an if !goroutineHeaderRe.MatchString(r.output) { malformedNodes =
append(malformedNodes, fmt.Sprintf("node %s: output length=%d\n%s", r.name,
len(r.output), r.output)); continue } and after the loop assert that
malformedNodes is empty (e.g., o.Expect(len(malformedNodes)).To(o.Equal(0),
"malformed CRI-O dumps:\n%s", strings.Join(malformedNodes, "\n"))), so you
accumulate diagnostics for all nodes rather than failing mid-loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 115-130: The test spawns a goroutine per node concurrently which
can overload the cluster/APIs; limit concurrent debug sessions by adding a
concurrency limiter (e.g., a buffered semaphore channel or worker pool) around
the loop so only a fixed number (configurable, e.g., 5) of goroutines run at
once; acquire the semaphore before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

---

Duplicate comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 178-182: The test currently calls g.Fail when any CRI-O goroutine
dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps
are written to GinkgoWriter instead and only trigger g.Fail when real failure
conditions remain (e.g., failedNodes or stuckPulls). Locate the block using
dumps and g.Fail in crio_goroutinedump.go, replace the unconditional
g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter
(with context header) and leaves existing failure logic that checks failedNodes
and stuckPulls intact so only those conditions cause test failures.
- Around line 158-160: The test currently aborts on the first malformed per-node
CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside
the loop; instead, collect failures and assert once after the loop. Create a
slice (e.g., malformedNodes []string) before iterating, replace the inline
o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) {
malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output
length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop
assert that malformedNodes is empty (e.g.,
o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s",
strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all
nodes rather than failing mid-loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6a35fe37-1646-4179-ab57-d6e15aa73f26

📥 Commits

Reviewing files that changed from the base of the PR and between 0418318 and f1853da.

📒 Files selected for processing (1)
  • test/extended/node/crio_goroutinedump.go

Comment on lines +115 to +130
for i, node := range nodes {
wg.Add(1)
go func(i int, nodeName string) {
defer g.GinkgoRecover()
defer wg.Done()

g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))

output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
oc, nodeName, "default",
"sh", "-c", shellCmd,
)
results[i] = nodeResult{name: nodeName, output: output, err: err}
}(i, node.Name)
}
wg.Wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound parallel debug sessions to avoid cluster/API overload.

Line 115 currently spawns one debug flow per node at once. On larger clusters this can create a burst of debug pods and induce throttling/timeouts, making this test flaky.

Suggested change
 			results := make([]nodeResult, len(nodes))
 			var wg sync.WaitGroup
+			sem := make(chan struct{}, 10) // tune as needed
 			for i, node := range nodes {
 				wg.Add(1)
 				go func(i int, nodeName string) {
 					defer g.GinkgoRecover()
 					defer wg.Done()
+					sem <- struct{}{}
+					defer func() { <-sem }()
 
 					g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 115 - 130, The test
spawns a goroutine per node concurrently which can overload the cluster/APIs;
limit concurrent debug sessions by adding a concurrency limiter (e.g., a
buffered semaphore channel or worker pool) around the loop so only a fixed
number (configurable, e.g., 5) of goroutines run at once; acquire the semaphore
before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 16, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: f1853da

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-gcp-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi High - "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: f1853da

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 9, Pass: 0, Fail: 9, Flake: 0]

@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from f1853da to 5c996c2 Compare April 16, 2026 15:50
@bitoku bitoku changed the title WIP: OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1 OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1 Apr 16, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/extended/node/crio_goroutinedump.go (2)

149-150: Consider aggregating malformed output failures for completeness.

This o.Expect will abort on the first node that returns output lacking goroutine headers, preventing inspection of remaining nodes. While the code now properly aggregates CRIO_NOT_FOUND, DUMP_TIMEOUT, and generic errors, this edge case (valid output without goroutine pattern) could also be collected into failedNodes.

That said, this scenario is unlikely in practice since it would require the debug pod to succeed but return completely unexpected content. The current behavior may be acceptable as a fail-fast for fundamentally broken scenarios.

Optional: aggregate instead of fail-fast
-			o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(),
-				"expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output)
+			if !goroutineHeaderRe.MatchString(r.output) {
+				failedNodes = append(failedNodes,
+					fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, got (len=%d):\n%s", r.name, len(r.output), r.output))
+				continue
+			}

70-72: Misplaced defer g.GinkgoRecover() has no effect here.

GinkgoRecover() is designed to capture panics in goroutines spawned during tests. At the Describe level, this defer executes when the setup function returns (during test registration), not during test execution. It won't catch panics in the It block.

If panic recovery is needed, place GinkgoRecover() inside goroutines within the test, or simply remove this line since the test runs serially without spawning goroutines.

Suggested fix
 var _ = g.Describe("[sig-node][Late]", func() {
-	defer g.GinkgoRecover()
 	oc := exutil.NewCLIWithoutNamespace("crio-goroutine-dump")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/crio_goroutinedump.go` around lines 70 - 72, The deferred
call to g.GinkgoRecover() inside the Describe registration (the line with "defer
g.GinkgoRecover()") is ineffective and should be removed or relocated: delete
that defer from the Describe block (the g.Describe(...) closure) and, if you
need panic recovery for any goroutine spawned in the test, call
g.GinkgoRecover() inside those specific goroutines (or at the start of the
goroutine function) instead; reference the Describe declaration (g.Describe) and
the g.GinkgoRecover symbol when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 70-72: The deferred call to g.GinkgoRecover() inside the Describe
registration (the line with "defer g.GinkgoRecover()") is ineffective and should
be removed or relocated: delete that defer from the Describe block (the
g.Describe(...) closure) and, if you need panic recovery for any goroutine
spawned in the test, call g.GinkgoRecover() inside those specific goroutines (or
at the start of the goroutine function) instead; reference the Describe
declaration (g.Describe) and the g.GinkgoRecover symbol when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4c85c6ad-489a-46ea-a97a-109acfe81cc9

📥 Commits

Reviewing files that changed from the base of the PR and between f1853da and 5c996c2.

📒 Files selected for processing (2)
  • test/extended/node/crio_goroutinedump.go
  • test/extended/node/crio_goroutinedump_test.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

defer g.GinkgoRecover()
oc := exutil.NewCLIWithoutNamespace("crio-goroutine-dump")

g.It("CRI-O should report goroutine stacks on all nodes",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main test should probably be named specific to the stuck image pulls rather than the gathering. it can fail on gathering, but the test name should reflect the symptom more closely i think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! updated.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 17, 2026

/retest

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 17, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 17, 2026
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 17, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @asahay19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from asahay19 April 17, 2026 00:23
@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 17, 2026

Job Failure Risk Analysis for sha: 5c996c2

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi Low
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 14 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:upi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 5c996c2

  • "[sig-node][Late] CRI-O should report goroutine stacks on all nodes [Suite:openshift/conformance/parallel]" [Total: 10, Pass: 10, Fail: 0, Flake: 0]

@bitoku bitoku force-pushed the crio-goroutine-dump-late-test branch from 5c996c2 to 0118328 Compare April 20, 2026 12:28
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

fi
sleep 1
done
echo "DUMP_TIMEOUT"; exit 1`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry here is this is going to be brittle and need reverting because it went in and doesn't run on some of our more obscure configurations. (hypershift, ipv6, etc)

Intervals files, such as https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.20-e2e-aws-ovn-runc-techpreview/2044941453007785984/artifacts/e2e-aws-ovn-runc-techpreview/openshift-e2e-test/artifacts/junit/e2e-events_20260417-014301.json (don't click, it's huge), the intervals will show this single symptom: jq '[.items[] | select(
.source == "PodState" and
.message.reason == "PodWasPending" and
.message.humanMessage == "never completed"
) | {pod: .locator.keys.pod, namespace: .locator.keys.namespace, from, to}]'
e2e-events_*.json

If that returns results, it appears to be our problem. it would be trivial to add a monitortest to scan the intervals we collect for this and fail a junit, and that should run reliably anywhere.

Copy link
Copy Markdown
Contributor Author

@bitoku bitoku Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I thought we wanted to make sure that this monitor pod issue is specifically caused by a TCP stall.
Scanning the intervals will only show that a pod stalled somewhere, but won't give us the details. We should probably get a goroutine dump as well though I'm not sure what's the best and least invasive way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgoodwin opened a new PR using intervals #31045

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense, lets proceed with both, if this variant takes a couple tries to land smoothly so be it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. it's an informing test, so it may mess up the result, but shouldn't block any jobs.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 21, 2026

/retest

@dgoodwin
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2026
@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 21, 2026

/retest

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 21, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 0118328

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls on any node [Suite:openshift/conformance/parallel]" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High - "[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls on any node [Suite:openshift/conformance/parallel]" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial High - "[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls on any node [Suite:openshift/conformance/parallel]" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-gcp-ovn High - "[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls on any node [Suite:openshift/conformance/parallel]" is a new test that was not present in all runs against the current commit.

New tests seen in this PR at sha: 0118328

  • "[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls on any node [Suite:openshift/conformance/parallel]" [Total: 16, Pass: 16, Fail: 0, Flake: 0]

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 27, 2026

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 68de8b2 into openshift:main Apr 27, 2026
20 of 21 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: Jira Issue OCPBUGS-79544: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-79544 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.

The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).

Summary by CodeRabbit

  • Tests
  • Added a node-level diagnostic that collects container runtime goroutine dumps from each node, verifies a dump is produced, and fails if dumps are missing or show stuck image-pull goroutines with prolonged I/O wait.
  • Added unit tests for the goroutine-dump parser to validate detection of stuck image-pull stacks and edge cases (various durations, missing frames, mixed dumps).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 27, 2026

/cherry-pick release-4.22 release-4.21

@openshift-cherrypick-robot
Copy link
Copy Markdown

@bitoku: new pull request created: #31074

Details

In response to this:

/cherry-pick release-4.22 release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-04-27-183150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants