OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1#31013
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Ginkgo node test and helper that trigger CRI‑O to write goroutine dumps on each cluster node, collect those dumps (handling PrivateTmp), parse them for IO‑wait goroutines >30 minutes that show Changes
Sequence DiagramsequenceDiagram
participant TestRunner as Test Runner
participant Node as Cluster Node
participant CRIO as CRI‑O Process
participant Parser as Dump Parser
TestRunner->>Node: run shell to find CRI‑O PID
Node-->>TestRunner: PID or CRIO_NOT_FOUND
TestRunner->>CRIO: send SIGUSR1 to PID
activate CRIO
CRIO->>CRIO: write /tmp/crio-goroutine-stacks-*.log
CRIO-->>TestRunner: dump file created (or timeout)
deactivate CRIO
TestRunner->>Node: fetch newest dump file (via /proc/<pid>/root/tmp)
Node-->>TestRunner: goroutine dump content or DUMP_TIMEOUT
TestRunner->>Parser: call findStuckImagePulls(dump)
Parser->>Parser: split into goroutine blocks
Parser->>Parser: extract IO wait minutes, filter > 30
Parser->>Parser: match stack pattern (net.(*conn).Read -> docker.(*bodyReader).Read)
Parser-->>TestRunner: list of stuck goroutines
alt stuck goroutines found
TestRunner->>TestRunner: fail test and emit dumps
else no stuck goroutines
TestRunner->>TestRunner: pass
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (7 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 84-88: The shellCmd currently grabs the newest
/tmp/crio-goroutine-stacks-*.log which can be a stale file or empty if none
exists; update the logic in the shellCmd string in crio_goroutinedump.go to
record the list (or latest mtime) of existing matching files before sending kill
-USR1, then after signalling poll/wait (with timeout) for a new file whose mtime
is greater than the pre-signal timestamp (or not in the pre-signal list), and
only then cat that new file; ensure the command returns a non-zero exit and
clear error message if no new dump appears within the timeout to avoid silent
bad reads.
- Around line 90-124: The loop currently treats missing CRI-O dumps as warnings
and continues, allowing partial node coverage to pass; modify the logic in the
loop that calls exutil.DebugNodeRetryWithOptionsAndChroot (the block handling
"CRIO_NOT_FOUND" and errors) to instead record failures (e.g., collect node
names into a missingDumps slice) and after the loop assert that missingDumps is
empty (failing the test with a clear message), or alternatively pre-filter the
nodes slice to only include nodes expected to run CRI-O before invoking
exutil.DebugNodeRetryWithOptionsAndChroot; update references to nodes, dumps,
stuckPulls, goroutineHeaderRe and findStuckImagePulls accordingly so
missing/filtered nodes are handled deterministically and the test fails if any
expected node did not produce a dump.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f0f2719c-5fb3-41e3-b81a-83372c196fdc
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
8c04a45 to
1cc9cb0
Compare
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 1cc9cb0
New tests seen in this PR at sha: 1cc9cb0
|
1cc9cb0 to
0418318
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 86-93: The MARKER temporary file created by MARKER=$(mktemp) can
leak if the script exits early (e.g., kill fails); add a shell trap immediately
after creating MARKER to ensure it's removed on all exit paths (trap 'rm -f
"$MARKER"' EXIT) and optionally on INT and TERM, then proceed with the existing
kill -USR1 $CRIO_PID and wait loop; update the block containing MARKER, kill
-USR1 $CRIO_PID and the find/seq loop to rely on that trap so the marker is
always cleaned up even on early failure.
- Around line 143-147: The current block unconditionally calls g.Fail when dumps
are present which causes successful runs to fail; modify the CRI-O goroutine
dump handling in the section that references dumps and g.Fail so that dumps are
written to GinkgoWriter (or test logs/artifacts) instead of failing the test,
and only call g.Fail if there are actual fault conditions such as failedNodes or
stuckPulls being non-empty; update the fmt.Sprintf/g.Fail call to use
GinkgoWriter.Printf (or similar) to emit the joined dumps and add a conditional
that preserves failing behavior only when failedNodes or stuckPulls indicate a
real error.
- Around line 123-125: The test currently uses o.Expect(...).To(...) inside the
node loop which aborts the spec on the first malformed CRI-O dump; instead,
introduce a failedNodes collection (e.g., []string failedNodes) declared before
iterating nodes, replace the immediate
o.Expect(goroutineHeaderRe.MatchString(output)).To(...) check inside the loop
with an if !goroutineHeaderRe.MatchString(output) branch that appends a
descriptive failure message (including node.Name and output) to failedNodes and
continues, and after the node loop add a single assertion (or
o.Expect(len(failedNodes)).To(o.Equal(0))) that fails the spec if failedNodes is
non-empty and includes the aggregated failedNodes report in the failure message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 365ad2c5-fee6-4164-bdd6-d5f0aabddbc0
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 0418318
New tests seen in this PR at sha: 0418318
|
0418318 to
f1853da
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
test/extended/node/crio_goroutinedump.go (2)
178-182:⚠️ Potential issue | 🟠 MajorDo not force-fail healthy runs just to emit dumps.
Line 180 makes the spec fail whenever dumps exist, so a healthy run can never pass. Emit dumps to
GinkgoWriterand keep failure paths tied to real faults (failedNodes/stuckPulls).Suggested change
- // Always fail so goroutine dumps are visible in test results. - // The test is marked Informing so this won't block merges. - if len(dumps) > 0 { - g.Fail(fmt.Sprintf("CRI-O goroutine dumps collected:\n%s", strings.Join(dumps, "\n\n"))) - } + // Emit dumps for visibility without forcing failure on success paths. + if len(dumps) > 0 { + g.By("CRI-O goroutine dumps collected") + _, _ = fmt.Fprintf(g.GinkgoWriter, "%s\n", strings.Join(dumps, "\n\n")) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 178 - 182, The test currently calls g.Fail when any CRI-O goroutine dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps are written to GinkgoWriter instead and only trigger g.Fail when real failure conditions remain (e.g., failedNodes or stuckPulls). Locate the block using dumps and g.Fail in crio_goroutinedump.go, replace the unconditional g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter (with context header) and leaves existing failure logic that checks failedNodes and stuckPulls intact so only those conditions cause test failures.
158-160:⚠️ Potential issue | 🟠 MajorAvoid failing mid-loop; aggregate malformed dump nodes and fail once.
Line 158 aborts on the first bad dump format, so you lose complete per-node diagnostics. Append this as a node failure and continue, then assert once after the loop.
Suggested change
- o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(), - "expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output) + if !goroutineHeaderRe.MatchString(r.output) { + failedNodes = append(failedNodes, + fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, output length=%d, got:\n%s", r.name, len(r.output), r.output)) + continue + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 158 - 160, The test currently aborts on the first malformed per-node CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside the loop; instead, collect failures and assert once after the loop. Create a slice (e.g., malformedNodes []string) before iterating, replace the inline o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) { malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop assert that malformedNodes is empty (e.g., o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s", strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all nodes rather than failing mid-loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 115-130: The test spawns a goroutine per node concurrently which
can overload the cluster/APIs; limit concurrent debug sessions by adding a
concurrency limiter (e.g., a buffered semaphore channel or worker pool) around
the loop so only a fixed number (configurable, e.g., 5) of goroutines run at
once; acquire the semaphore before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.
---
Duplicate comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 178-182: The test currently calls g.Fail when any CRI-O goroutine
dumps exist (variable dumps), forcing healthy runs to fail; change this so dumps
are written to GinkgoWriter instead and only trigger g.Fail when real failure
conditions remain (e.g., failedNodes or stuckPulls). Locate the block using
dumps and g.Fail in crio_goroutinedump.go, replace the unconditional
g.Fail(fmt.Sprintf(...)) with code that prints the joined dumps to GinkgoWriter
(with context header) and leaves existing failure logic that checks failedNodes
and stuckPulls intact so only those conditions cause test failures.
- Around line 158-160: The test currently aborts on the first malformed per-node
CRI-O dump by calling o.Expect(goroutineHeaderRe.MatchString(r.output)) inside
the loop; instead, collect failures and assert once after the loop. Create a
slice (e.g., malformedNodes []string) before iterating, replace the inline
o.Expect check in the loop with an if !goroutineHeaderRe.MatchString(r.output) {
malformedNodes = append(malformedNodes, fmt.Sprintf("node %s: output
length=%d\n%s", r.name, len(r.output), r.output)); continue } and after the loop
assert that malformedNodes is empty (e.g.,
o.Expect(len(malformedNodes)).To(o.Equal(0), "malformed CRI-O dumps:\n%s",
strings.Join(malformedNodes, "\n"))), so you accumulate diagnostics for all
nodes rather than failing mid-loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 6a35fe37-1646-4179-ab57-d6e15aa73f26
📒 Files selected for processing (1)
test/extended/node/crio_goroutinedump.go
| for i, node := range nodes { | ||
| wg.Add(1) | ||
| go func(i int, nodeName string) { | ||
| defer g.GinkgoRecover() | ||
| defer wg.Done() | ||
|
|
||
| g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName)) | ||
|
|
||
| output, err := exutil.DebugNodeRetryWithOptionsAndChroot( | ||
| oc, nodeName, "default", | ||
| "sh", "-c", shellCmd, | ||
| ) | ||
| results[i] = nodeResult{name: nodeName, output: output, err: err} | ||
| }(i, node.Name) | ||
| } | ||
| wg.Wait() |
There was a problem hiding this comment.
Bound parallel debug sessions to avoid cluster/API overload.
Line 115 currently spawns one debug flow per node at once. On larger clusters this can create a burst of debug pods and induce throttling/timeouts, making this test flaky.
Suggested change
results := make([]nodeResult, len(nodes))
var wg sync.WaitGroup
+ sem := make(chan struct{}, 10) // tune as needed
for i, node := range nodes {
wg.Add(1)
go func(i int, nodeName string) {
defer g.GinkgoRecover()
defer wg.Done()
+ sem <- struct{}{}
+ defer func() { <-sem }()
g.By(fmt.Sprintf("Sending SIGUSR1 to CRI-O on node %s", nodeName))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/crio_goroutinedump.go` around lines 115 - 130, The test
spawns a goroutine per node concurrently which can overload the cluster/APIs;
limit concurrent debug sessions by adding a concurrency limiter (e.g., a
buffered semaphore channel or worker pool) around the loop so only a fixed
number (configurable, e.g., 5) of goroutines run at once; acquire the semaphore
before launching the goroutine that calls
exutil.DebugNodeRetryWithOptionsAndChroot and release it (close/receive) in the
goroutine's defer along with wg.Done(), ensuring results[i] still gets set and
g.GinkgoRecover() remains called; reference the existing loop variables (nodes,
i, node.Name), wg, results, nodeResult, shellCmd and
exutil.DebugNodeRetryWithOptionsAndChroot when implementing the limiter.
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: f1853da
New tests seen in this PR at sha: f1853da
|
f1853da to
5c996c2
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/extended/node/crio_goroutinedump.go (2)
149-150: Consider aggregating malformed output failures for completeness.This
o.Expectwill abort on the first node that returns output lacking goroutine headers, preventing inspection of remaining nodes. While the code now properly aggregatesCRIO_NOT_FOUND,DUMP_TIMEOUT, and generic errors, this edge case (valid output without goroutine pattern) could also be collected intofailedNodes.That said, this scenario is unlikely in practice since it would require the debug pod to succeed but return completely unexpected content. The current behavior may be acceptable as a fail-fast for fundamentally broken scenarios.
Optional: aggregate instead of fail-fast
- o.Expect(goroutineHeaderRe.MatchString(r.output)).To(o.BeTrue(), - "expected goroutine stacks in CRI-O dump from node %s, output length=%d, got:\n%s", r.name, len(r.output), r.output) + if !goroutineHeaderRe.MatchString(r.output) { + failedNodes = append(failedNodes, + fmt.Sprintf("%s: expected goroutine stacks in CRI-O dump, got (len=%d):\n%s", r.name, len(r.output), r.output)) + continue + }
70-72: Misplaceddefer g.GinkgoRecover()has no effect here.
GinkgoRecover()is designed to capture panics in goroutines spawned during tests. At theDescribelevel, this defer executes when the setup function returns (during test registration), not during test execution. It won't catch panics in theItblock.If panic recovery is needed, place
GinkgoRecover()inside goroutines within the test, or simply remove this line since the test runs serially without spawning goroutines.Suggested fix
var _ = g.Describe("[sig-node][Late]", func() { - defer g.GinkgoRecover() oc := exutil.NewCLIWithoutNamespace("crio-goroutine-dump")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/node/crio_goroutinedump.go` around lines 70 - 72, The deferred call to g.GinkgoRecover() inside the Describe registration (the line with "defer g.GinkgoRecover()") is ineffective and should be removed or relocated: delete that defer from the Describe block (the g.Describe(...) closure) and, if you need panic recovery for any goroutine spawned in the test, call g.GinkgoRecover() inside those specific goroutines (or at the start of the goroutine function) instead; reference the Describe declaration (g.Describe) and the g.GinkgoRecover symbol when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/node/crio_goroutinedump.go`:
- Around line 70-72: The deferred call to g.GinkgoRecover() inside the Describe
registration (the line with "defer g.GinkgoRecover()") is ineffective and should
be removed or relocated: delete that defer from the Describe block (the
g.Describe(...) closure) and, if you need panic recovery for any goroutine
spawned in the test, call g.GinkgoRecover() inside those specific goroutines (or
at the start of the goroutine function) instead; reference the Describe
declaration (g.Describe) and the g.GinkgoRecover symbol when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 4c85c6ad-489a-46ea-a97a-109acfe81cc9
📒 Files selected for processing (2)
test/extended/node/crio_goroutinedump.gotest/extended/node/crio_goroutinedump_test.go
|
Scheduling required tests: |
| defer g.GinkgoRecover() | ||
| oc := exutil.NewCLIWithoutNamespace("crio-goroutine-dump") | ||
|
|
||
| g.It("CRI-O should report goroutine stacks on all nodes", |
There was a problem hiding this comment.
main test should probably be named specific to the stuck image pulls rather than the gathering. it can fail on gathering, but the test name should reflect the symptom more closely i think.
|
/retest |
|
/jira refresh |
|
@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Job Failure Risk Analysis for sha: 5c996c2
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 5c996c2
|
Assisted-by: Claude Code <https://claude.com/claude-code>
5c996c2 to
0118328
Compare
|
Scheduling required tests: |
| fi | ||
| sleep 1 | ||
| done | ||
| echo "DUMP_TIMEOUT"; exit 1` |
There was a problem hiding this comment.
My worry here is this is going to be brittle and need reverting because it went in and doesn't run on some of our more obscure configurations. (hypershift, ipv6, etc)
Intervals files, such as https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.20-e2e-aws-ovn-runc-techpreview/2044941453007785984/artifacts/e2e-aws-ovn-runc-techpreview/openshift-e2e-test/artifacts/junit/e2e-events_20260417-014301.json (don't click, it's huge), the intervals will show this single symptom: jq '[.items[] | select(
.source == "PodState" and
.message.reason == "PodWasPending" and
.message.humanMessage == "never completed"
) | {pod: .locator.keys.pod, namespace: .locator.keys.namespace, from, to}]'
e2e-events_*.json
If that returns results, it appears to be our problem. it would be trivial to add a monitortest to scan the intervals we collect for this and fail a junit, and that should run reliably anywhere.
There was a problem hiding this comment.
Alright. I thought we wanted to make sure that this monitor pod issue is specifically caused by a TCP stall.
Scanning the intervals will only show that a pod stalled somewhere, but won't give us the details. We should probably get a goroutine dump as well though I'm not sure what's the best and least invasive way.
There was a problem hiding this comment.
That does make sense, lets proceed with both, if this variant takes a couple tries to land smoothly so be it.
There was a problem hiding this comment.
Thanks. it's an informing test, so it may mess up the result, but shouldn't block any jobs.
|
/retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bitoku, dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 0118328
New tests seen in this PR at sha: 0118328
|
|
/verified by CI |
|
@bitoku: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
68de8b2
into
openshift:main
|
@bitoku: Jira Issue OCPBUGS-79544: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-79544 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 release-4.21 |
|
@bitoku: new pull request created: #31074 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-04-27-183150 |
This test runs after all other tests to capture CRI-O's goroutine state on every node, providing visibility into stuck operations after the suite has exercised CRI-O under load.
The test sends SIGUSR1 to CRI-O on each node, reads the resulting goroutine dump from /tmp/crio-goroutine-stacks-*.log, and fails hard if any goroutine is stuck in an image pull (IO wait > 30 min in docker.(*bodyReader).Read -> net.(*conn).Read).
Summary by CodeRabbit