must-gather: add client-side keep-alive to prevent accessTokenInactivityTimeout failures by swshende-cmd · Pull Request #2288 · openshift/oc

swshende-cmd · 2026-06-11T11:18:43Z

Summary

Add a client-side keep-alive mechanism to oc adm must-gather that prevents OAuth access token
revocation due to accessTokenInactivityTimeout on hardened clusters.

Problem: When accessTokenInactivityTimeout is configured (e.g., 300s for regulated
environments), oc adm must-gather fails with "Unauthorized" errors mid-collection. The root
cause is that the oc client holds a single long-lived HTTP log-follow stream to watch the
gather pod. The API server does not count this persistent connection as "activity" for the OAuth
inactivity timer, causing the token to be revoked.

Fix: A background goroutine that calls Discovery().ServerVersion() (authenticated
GET /version) every 30 seconds during the gather lifecycle. This resets the OAuth inactivity
timer on each call, keeping the session alive for the full duration of the operation.

Resolves: RFE-8595

What Changed

Two additions to pkg/cli/admin/mustgather/mustgather.go:

startClientKeepAlive(ctx) — Background goroutine making periodic authenticated API
probes (every 30s) to reset the OAuth token inactivity timer.
Integration in processNextWorkItem() — Started after pod creation, stopped via
defer after data copy completes. Covers: pod startup wait -> log streaming -> gather
completion wait -> rsync download.

stopKeepAlive := o.startClientKeepAlive(ctx)
defer stopKeepAlive()

Test Evidence

Environment

Cluster: OCP 4.20.0 (Kubernetes v1.33.11)
OAuth config: accessTokenInactivityTimeout: 300s
Token: inactivityTimeoutSeconds: 300 (confirmed on OAuthAccessToken object)
Test scenario: sleep 360 (6-min idle gap) injected before /usr/bin/gather
— forces a window with zero client-side API calls, exceeding the 5-minute timeout

WITHOUT the fix (unpatched oc)

$ ./oc-unpatched adm must-gather --timeout=15m \
    -- '/bin/bash -c "echo GATHER_START; sleep 360; echo GATHER_AFTER_SLEEP; /usr/bin/gather"'

[must-gather-5npp9] POD 05:30:13Z GATHER_START
[must-gather-5npp9] POD 05:36:13Z GATHER_AFTER_SLEEP
[must-gather-5npp9] OUT 05:43:22Z gather logs unavailable: Unauthorized
[must-gather-5npp9] OUT 05:43:22Z waiting for gather to complete
[must-gather-5npp9] OUT 05:58:22Z gather never finished: context deadline exceeded
Unauthorized
error getting cluster version: Unauthorized
error getting cluster operators: Unauthorized
error: gather never finished for pod must-gather-5npp9: context deadline exceeded

Metric	Result
Exit code	1 (FAILURE)
Duration	28 min (hung on deadline)
Data collected	12 KB (3 files - empty shell)
Unauthorized errors	4

WITH the fix (patched oc - keep-alive enabled)

$ ./oc adm must-gather --timeout=15m \
    -- '/bin/bash -c "echo GATHER_START; sleep 360; echo GATHER_AFTER_SLEEP; /usr/bin/gather"'

[must-gather-7cv6s] POD 04:04:28Z GATHER_START
[must-gather-7cv6s] POD 04:10:28Z GATHER_AFTER_SLEEP
[must-gather-7cv6s] OUT 04:17:29Z waiting for gather to complete
[must-gather-7cv6s] OUT 04:17:29Z downloading gather output
... (full collection completes successfully) ...

Metric	Result
Exit code	0 (SUCCESS)
Duration	37 min
Data collected	6.4 GB (8,489 files)
Unauthorized errors	0

Comparative Summary

	Without Fix	With Fix
Exit code	1 (failure)	0 (success)
Data collected	12 KB	6.4 GB
Files	3	8,489
Unauthorized errors	4	0
Token survived 300s policy?	Revoked at ~5 min	Alive for 37 min

Unit Tests

$ go test ./pkg/cli/admin/mustgather/ -v -count=1

=== RUN   TestStartClientKeepAlive
=== RUN   TestStartClientKeepAlive/makes_periodic_authenticated_API_calls
=== RUN   TestStartClientKeepAlive/stops_when_cancel_is_called
--- PASS: TestStartClientKeepAlive (35.10s)
PASS
ok      github.com/openshift/oc/pkg/cli/admin/mustgather    37.265s

Use Case

This addresses environments where short accessTokenInactivityTimeout values (e.g., 5-10
minutes) are mandated by regulatory or security compliance policies (Banking, Government,
PCI-DSS). Previously, administrators in such environments could not run oc adm must-gather
without either:

Temporarily relaxing the timeout (violating security policy), or
Running a manual workaround (while true; do oc whoami; sleep 60; done) in the background

This fix eliminates both workarounds by building the equivalent heartbeat directly into oc.

Design Decisions

30s interval — Well within any reasonable timeout (minimum ~60s), fires ~10 times before
a 5-minute timeout would expire
Discovery().ServerVersion() — Lightest possible authenticated API call; single GET,
no list/watch, no resource creation
Non-fatal errors — Probe failures are logged at V(5) only; transient network blips
won't kill the gather
Lifecycle scoped — Starts at pod creation, stops at data download completion via defer
Zero configuration — Always-on, no flags needed, fully backward-compatible
No cluster-side changes — Entirely client-side fix; no admin action required on the cluster

How to Reproduce

# 1. Set 5-min inactivity timeout
oc patch oauth.config.openshift.io cluster --type=merge \
  -p '{"spec":{"tokenConfig":{"accessTokenInactivityTimeout":"300s"}}}'

# 2. Wait for rollout
oc rollout status deployment/oauth-openshift -n openshift-authentication --timeout=300s

# 3. Force fresh login (new token gets the policy)
oc logout && oc login -u <user> -p <pass> <api-url> --insecure-skip-tls-verify

# 4. Run must-gather with idle gap exceeding timeout
oc adm must-gather --timeout=15m \
  -- '/bin/bash -c "sleep 360; /usr/bin/gather"'

# 5. Restore
oc patch oauth.config.openshift.io cluster --type=json \
  -p '[{"op":"remove","path":"/spec/tokenConfig"}]'

Made with Cursor

Summary by CodeRabbit

Bug Fixes
- Improved reliability of must-gather during long-running collections by maintaining an authenticated background probe.
Configuration
- Added a configurable keep-alive interval to control probe frequency during gathers.
Tests
- Added unit tests confirming background probe activity runs during gathers and stops when canceled.

…ityTimeout failures Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-06-11T11:18:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b8a26114-6b78-4607-a5ed-66d974e7bd09

📥 Commits

Reviewing files that changed from the base of the PR and between 11351c1 and 74a3113.

📒 Files selected for processing (2)

pkg/cli/admin/mustgather/mustgather.go
pkg/cli/admin/mustgather/mustgather_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/cli/admin/mustgather/mustgather_test.go
pkg/cli/admin/mustgather/mustgather.go

Walkthrough

Adds a cancellable background keep-alive that periodically calls the Kubernetes discovery API (ServerVersion) to prevent OAuth token inactivity during long-running must-gather log-follow operations; wires it into pod processing and adds a unit test.

Changes

OAuth Keep-Alive Mechanism

Layer / File(s)	Summary
Keep-alive constant and options `pkg/cli/admin/mustgather/mustgather.go`	Adds `defaultKeepAliveInterval` and extends `MustGatherOptions` with a `keepAliveInterval` field.
Keep-alive implementation `pkg/cli/admin/mustgather/mustgather.go`	Adds `(*MustGatherOptions).startClientKeepAlive(ctx)` which starts a cancellable ticker goroutine that periodically calls `Client.Discovery().ServerVersion()`.
Integration into pod processing `pkg/cli/admin/mustgather/mustgather.go`	Starts the keep-alive before waiting for the gather container and defers cancellation until after gather completion and local file copy.
Unit tests `pkg/cli/admin/mustgather/mustgather_test.go`	Adds `TestStartClientKeepAlive` with subtests verifying repeated discovery calls while active and no further calls after cancellation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
No-Sensitive-Data-In-Logs	❌ Error	startClientKeepAlive logs raw discovery error with klog.V(2).Infof("keep-alive probe failed...: %v", err) which may include internal hostnames/other sensitive details in the error string.	In keep-alive failure logging, avoid printing raw err; log a sanitized/generic message (optionally error type) so error strings don’t leak internal hostnames/tokens.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a client-side keep-alive mechanism to prevent OAuth access token inactivity timeout failures during must-gather operations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Reviewed pkg/cli/admin/mustgather/mustgather_test.go: only subtests use static strings (“makes periodic authenticated API calls”, “stops when context is cancelled”); no dynamic pod/timestamp/UUID d...
Test Structure And Quality	✅ Passed	pkg/cli/admin/mustgather/mustgather_test.go has TestStartClientKeepAlive with two focused t.Run subtests; no cluster resources/timeouts/E2E waits, and failure messages are provided.
Microshift Test Compatibility	✅ Passed	PR `#2288` modifies only mustgather.go and mustgather_test.go; no new Ginkgo/Describe/It e2e tests referencing unsupported MicroShift APIs.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR `#2288` only modifies must-gather CLI/unit test (pkg/cli/admin/mustgather/mustgather.go, mustgather_test.go); no new Ginkgo e2e tests or multi-node/HA SNO assumptions to flag. citeturn3view0tur...
Topology-Aware Scheduling Compatibility	✅ Passed	PR changes only add a MustGatherOptions keep-alive goroutine (Discovery().ServerVersion) and its unit test; no added scheduling constraints/affinity/topology-aware replica logic in manifests or con...
Ote Binary Stdout Contract	✅ Passed	In changed files, there are no init/main/TestMain/Ginkgo suite entrypoints, and any fmt.Printf appears inside functions (e.g., cleanup), not at top level; no klog.SetOutput/LogToStderr found.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR `#2288` only changes mustgather.go and mustgather_test.go; no new Ginkgo e2e tests found (no “ginkgo”/“It(“ in diff), and the added test uses a fake clientset.
No-Weak-Crypto	✅ Passed	Scanned pkg/cli/admin/mustgather/mustgather.go and _test.go for MD5/SHA1/DES/RC4/3DES/Blowfish/ECB and crypto imports; none found, and no token/secret == comparisons detected.
Container-Privileges	✅ Passed	PR `#2288` only changes pkg/cli/admin/mustgather/mustgather.go and mustgather_test.go; the PR diff contains no privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalation fields.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-11T11:19:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swshende-cmd
Once this PR has been reviewed and has the lgtm label, please assign atiratree for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-06-11T11:20:00Z

Hi @swshende-cmd. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 703-753: The test should drive the keep-alive ticker instead of
sleeping; add a testable hook to MustGatherOptions (e.g., a field
KeepAliveInterval time.Duration or a TickerFactory/Clock) and update
startClientKeepAlive to use that hook instead of the hard-coded
defaultKeepAliveInterval (or direct time.NewTicker). Change startClientKeepAlive
to accept or read the injected interval/ticker and use
time.NewTicker(o.KeepAliveInterval) (or the provided ticker) so tests can set a
short interval or provide a controllable ticker; update the test to set
o.KeepAliveInterval to a small duration (or supply a fake ticker) and assert
that fakeClient.Actions() increases only during driven ticks and remains flat
after calling stopKeepAlive().

In `@pkg/cli/admin/mustgather/mustgather.go`:
- Around line 829-840: startClientKeepAlive currently calls
o.Client.Discovery().ServerVersion() which uses context.TODO() so in-flight
probes won’t cancel; change the probe to derive a short probeCtx from the
keep-alive ctx using context.WithTimeout(ctx, <shortDuration>) and cancel it
after the call, and perform the probe using
o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the
ServerVersionContext equivalent if available) so each keep-alive run respects
cancellation and a bounded deadline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a97219d1-9b54-43b7-8eaf-fd2e641f9db0

📥 Commits

Reviewing files that changed from the base of the PR and between 4007283 and 9d8933f.

📒 Files selected for processing (2)

pkg/cli/admin/mustgather/mustgather.go
pkg/cli/admin/mustgather/mustgather_test.go

coderabbitai · 2026-06-11T11:28:13Z

+		ticker := time.NewTicker(defaultKeepAliveInterval)
+		defer ticker.Stop()
+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case <-ticker.C:
+				// A lightweight authenticated GET that resets the OAuth
+				// inactivity timer. Errors are expected if the cluster is
+				// temporarily unreachable and are not actionable here.
+				if _, err := o.Client.Discovery().ServerVersion(); err != nil {
+					klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate keep-alive implementation and relevant constants. rg -n "defaultKeepAliveInterval|startClientKeepAlive|ServerVersion\\(|keep-alive|keepAlive" pkg/cli/admin/mustgather/mustgather.go # Show the exact keep-alive probe loop and the surrounding function(s). sed -n '780,920p' pkg/cli/admin/mustgather/mustgather.go # Also show the full startClientKeepAlive function if it spans outside the snippet. rg -n "func \\(o \\*MustGatherOptions\\) startClientKeepAlive" -n pkg/cli/admin/mustgather/mustgather.go python3 - <<'PY' import re, pathlib p = pathlib.Path("pkg/cli/admin/mustgather/mustgather.go") txt = p.read_text() m = re.search(r'func \\(o \\*MustGatherOptions\\) startClientKeepAlive\\b[\\s\\S]*?\\n}\\n', txt) if m: print(m.group(0)) else: print("startClientKeepAlive function body not found by regex") PY # Check for any HTTP client / rest.Config timeout setup on o.Client elsewhere in mustgather.go. rg -n "Timeout:|Dial|TLSHandshake|ResponseHeader|IdleConn|Transport|http\\.Client|rest\\.Config|WrapTransport|QPS|Burst" pkg/cli/admin/mustgather/mustgather.go sed -n '1,260p' pkg/cli/admin/mustgather/mustgather.go

Repository: openshift/oc

Length of output: 17216

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check if the discovery interface offers a context-aware ServerVersion call. rg -n "ServerVersionWithContext|type DiscoveryInterface|ServerVersion\\(" -S vendor/k8s.io/client-go 2>/dev/null || true rg -n "ServerVersionWithContext|ServerVersion\\(" -S . | head -n 50 # Check if mustgather or its options set rest.Config timeouts / transport timeouts. rg -n "Config\\.Timeout|o\\.Config\\.Timeout|rest\\.Config\\{[^}]*Timeout|Timeout:\\s*.*time\\.Second|http\\.Client\\{[^}]*Timeout" pkg/cli/admin/mustgather/mustgather.go # Inspect MustGatherOptions definition for any HTTP/rest timeout customization fields. rg -n "type MustGatherOptions struct" -n pkg/cli/admin/mustgather/mustgather.go sed -n '380,520p' pkg/cli/admin/mustgather/mustgather.go # Search for any factory/rest config tweaking patterns in this command. rg -n "ToRESTConfig\\(|NewForConfig\\(|WithTimeout|Timeout:" pkg/cli/admin/mustgather/mustgather.go

Repository: openshift/oc

Length of output: 4209

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect how DiscoveryClient.ServerVersion() performs the request. sed -n '640,740p' vendor/k8s.io/client-go/discovery/discovery_client.go # Look for any context usage in that function. rg -n "func \\(d \\*DiscoveryClient\\) ServerVersion\\(" -n vendor/k8s.io/client-go/discovery/discovery_client.go rg -n "Do\\(|WithContext|context\\." vendor/k8s.io/client-go/discovery/discovery_client.go | head -n 50 # Check whether rest.Config.Timeout exists/gets applied (HTTP client timeout). rg -n "type Config struct" vendor/k8s.io/client-go/rest/config.go rg -n "Timeout" vendor/k8s.io/client-go/rest/config.go | head -n 80 rg -n "HTTPClient.*Timeout|Timeout.*http\\.Client|Client\\.Timeout" -S vendor/k8s.io/client-go/rest | head -n 80 # Search this repo for rest.Config.Timeout being set (to see if factory/oc wiring sets it). rg -n "Config\\.Timeout\\s*=" pkg cmd hack | head -n 50 rg -n "HTTPClient\\{.*Timeout|http\\.Client\\{.*Timeout" pkg cmd hack | head -n 50

Repository: openshift/oc

Length of output: 4459

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find defaultTimeout value in discovery client code. rg -n "defaultTimeout|setDiscoveryDefaults\\(" vendor/k8s.io/client-go/discovery/discovery_client.go sed -n '1,120p' vendor/k8s.io/client-go/discovery/discovery_client.go sed -n '120,220p' vendor/k8s.io/client-go/discovery/discovery_client.go sed -n '560,760p' vendor/k8s.io/client-go/discovery/discovery_client.go # Check how kubernetes clientset builds the discovery client and whether it calls setDiscoveryDefaults. rg -n "setDiscoveryDefaults\\(" vendor/k8s.io/client-go | head -n 50 rg -n "DiscoveryClient" vendor/k8s.io/client-go/kubernetes -S | head -n 50 # Inspect rest/request.go timeout behavior (HTTP client timeout source). sed -n '120,220p' vendor/k8s.io/client-go/rest/request.go

Repository: openshift/oc

Length of output: 20771

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for must-gather specific config flags or request-timeout wiring. rg -n "request-timeout|RequestTimeout|ConfigFlags|genericclioptions\\.ConfigFlags" pkg/cli/admin/mustgather/mustgather.go # See if oc globally supports --request-timeout and how it maps to rest.Config.Timeout. rg -n "\"request-timeout\"|request-timeout" pkg cmd | head -n 50 # Inspect any shared config flag code that sets rest.Config.Timeout (might be in pkg/helpers/cmd or similar). rg -n "Config\\.Timeout\\s*=|clientConfig\\.Timeout\\s*=" pkg cmd | head -n 50 # If we can find a config flag named RequestTimeout, read its implementation. rg -n "type .*RequestTimeout|RequestTimeout" pkg cmd | head -n 50

Repository: openshift/oc

Length of output: 38

🏁 Script executed:

#!/bin/bash set -euo pipefail # Confirm kubernetes clientset's Discovery() method type. rg -n "func \\(.*\\) Discovery\\(" vendor/k8s.io/client-go/kubernetes/clientset.go sed -n '450,520p' vendor/k8s.io/client-go/kubernetes/clientset.go

Repository: openshift/oc

Length of output: 2756

Use context/timeout for each keep-alive probe (don’t rely on discovery’s internal context.TODO()).

startClientKeepAlive cancels the ticker via ctx.Done(), but o.Client.Discovery().ServerVersion() issues the /version request with context.TODO(), so the in-flight probe won’t stop when you cancel stopKeepAlive(). Although client-go discovery applies a per-request HTTP timeout (defaults to 32s when unset), each probe should be executed with a probeCtx that derives from the keep-alive ctx and has a short deadline (e.g. via o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx)).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/mustgather/mustgather.go` around lines 829 - 840, startClientKeepAlive currently calls o.Client.Discovery().ServerVersion() which uses context.TODO() so in-flight probes won’t cancel; change the probe to derive a short probeCtx from the keep-alive ctx using context.WithTimeout(ctx, <shortDuration>) and cancel it after the call, and perform the probe using o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the ServerVersionContext equivalent if available) so each keep-alive run respects cancellation and a bounded deadline.

Source: Coding guidelines

dpateriya · 2026-06-11T11:35:00Z

/ok-to-test

tchap · 2026-06-11T11:53:49Z

Shouldn't the RFE be approved first? Not saying this is not a good idea...

swshende-cmd · 2026-06-11T13:13:27Z

@tchap Hi, I agree with you, and apologies if I have missed the workflow here, I am happy to be corrected for the right workflow to be followed and if you can direct me accordingly, that ll be helpful.

- Make keep-alive interval configurable via unexported field for testability - Add context cancellation check before each probe to ensure clean shutdown - Filter context.Canceled errors from non-fatal log output - Reduce test runtime from 35s to <1s by using 50ms interval in tests Co-authored-by: Cursor <cursoragent@cursor.com>

tchap · 2026-06-11T13:26:39Z

@swshende-cmd No big deal, just good to clarify the situation, probably in the associated Jira issue comments. Then you have much higher probability of your PR being accepted, when the RFE is marked to be implemented.

coderabbitai

♻️ Duplicate comments (1)

pkg/cli/admin/mustgather/mustgather_test.go (1)

741-746: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flaky stop assertion due immediate post-cancel snapshot.

On Line 744, before := len(fakeClient.Actions()) is taken immediately after stopKeepAlive(). Because cancellation is async, one in-flight probe can still land, so Line 748 may fail nondeterministically even when shutdown is correct.

Suggested stabilization

 		stopKeepAlive()
-
-		before := len(fakeClient.Actions())
-		time.Sleep(200 * time.Millisecond)
+		// Allow any in-flight probe selected before cancel to complete.
+		time.Sleep(75 * time.Millisecond)
+		before := len(fakeClient.Actions())
+		time.Sleep(200 * time.Millisecond)
 		after := len(fakeClient.Actions())

 		if after != before {
 			t.Errorf("keep-alive goroutine continued after cancel: actions before=%d, after=%d", before, after)
 		}

Also applies to: 748-750

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/mustgather/mustgather_test.go` around lines 741 - 746, The test
takes a snapshot of fakeClient.Actions() immediately after calling
stopKeepAlive(), which races with async cancellation and can observe an
in-flight probe; change the assertion to poll until the recorded action count
has stabilized instead of sampling immediately: after calling stopKeepAlive()
call a small retry loop (with a short sleep, e.g., 10–20ms, and overall timeout)
that reads fakeClient.Actions() repeatedly until two consecutive reads are
equal, then use that stable value as `before` and assert no further actions are
added (or capture `after` similarly); reference stopKeepAlive() and
fakeClient.Actions() when implementing the polling/stabilization in
mustgather_test.go.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 741-746: The test takes a snapshot of fakeClient.Actions()
immediately after calling stopKeepAlive(), which races with async cancellation
and can observe an in-flight probe; change the assertion to poll until the
recorded action count has stabilized instead of sampling immediately: after
calling stopKeepAlive() call a small retry loop (with a short sleep, e.g.,
10–20ms, and overall timeout) that reads fakeClient.Actions() repeatedly until
two consecutive reads are equal, then use that stable value as `before` and
assert no further actions are added (or capture `after` similarly); reference
stopKeepAlive() and fakeClient.Actions() when implementing the
polling/stabilization in mustgather_test.go.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e232dcc2-e76e-49eb-b326-677bfebea853

📥 Commits

Reviewing files that changed from the base of the PR and between 9d8933f and a6c1c51.

📒 Files selected for processing (2)

pkg/cli/admin/mustgather/mustgather.go
pkg/cli/admin/mustgather/mustgather_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/cli/admin/mustgather/mustgather.go

swshende-cmd · 2026-06-11T13:51:54Z

@tchap right 👍
The intent was to see if I could contribute as it has been in backlog for sometime..
Would you be able to please review this?

tchap · 2026-06-11T14:07:21Z

+//
+// Returns a cancel function that stops the goroutine. The caller must invoke it
+// when the keep-alive is no longer needed.
+func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc {


You can either use the ctx argument or the cancel func, but it doesn't make sense to use both. I lean towards just using ctx and letting the caller cancel the context to stop the goroutine.

tchap · 2026-06-11T14:13:07Z

@@ -445,8 +453,9 @@ type MustGatherOptions struct {
 	Since            time.Duration
 	SinceTime        string

-	RsyncRshCmd string
-	clock       clock.PassiveClock
+	RsyncRshCmd        string
+	keepAliveInterval  time.Duration
+	clock              clock.PassiveClock

 	PrinterCreated printers.ResourcePrinter
 	PrinterDeleted printers.ResourcePrinter
@@ -805,6 +814,42 @@ func (o *MustGatherOptions) Run() error {
 	return kutilerrors.NewAggregate(errs)
 }

+// startClientKeepAlive spawns a background goroutine that periodically makes
+// an authenticated API call to prevent the user's OAuth access token from being
+// revoked due to accessTokenInactivityTimeout. The log-follow connection held
+// by getGatherContainerLogs is a single long-lived HTTP stream that the API
+// server does not count as discrete activity, so without these probes the token
+// can expire on clusters with short inactivity windows (e.g. 5-10 minutes in
+// banking/government/PCI-DSS environments).
+//
+// Returns a cancel function that stops the goroutine. The caller must invoke it
+// when the keep-alive is no longer needed.
+func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc {
+	ctx, cancel := context.WithCancel(ctx)
+	interval := o.keepAliveInterval
+	if interval == 0 {
+		interval = defaultKeepAliveInterval
+	}
+	go func() {
+		ticker := time.NewTicker(interval)
+		defer ticker.Stop()
+		for {
+			select {
+			case <-ctx.Done():
+				return
+			case <-ticker.C:
+				if ctx.Err() != nil {
+					return
+				}
+				if _, err := o.Client.Discovery().ServerVersion(); err != nil && !errors.Is(err, context.Canceled) {
+					klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err)


I would make it like V(2), this is effectively invisible.

swshende-cmd · 2026-06-11T16:10:31Z

@tchap sure, I have updated and added proposal and test evidences to the RFE too.
Checking you latest review comments and addressing

- Remove internal context.WithCancel; let caller control goroutine lifetime via ctx cancellation (simpler API, no redundant cancel func) - Change keep-alive probe log level from V(5) to V(2) for better visibility during troubleshooting Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

openshift-ci · 2026-06-12T05:56:17Z

@swshende-cmd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-serial-1of2	`74a3113`	link	true	`/test e2e-aws-ovn-serial-1of2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

must-gather: add client-side keep-alive to prevent accessTokenInactiv…

9d8933f

…ityTimeout failures Co-authored-by: Cursor <cursoragent@cursor.com>

openshift-ci Bot requested review from ingvagabund and tchap June 11, 2026 11:19

openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

tchap suggested changes Jun 11, 2026

View reviewed changes

openshift-ci Bot assigned tchap Jun 11, 2026

shendeswapnil6 and others added 2 commits June 11, 2026 21:44

must-gather: fix gofmt formatting

74a3113

Co-authored-by: Cursor <cursoragent@cursor.com>

Conversation

swshende-cmd commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Test Evidence

Environment

WITHOUT the fix (unpatched oc)

WITH the fix (patched oc - keep-alive enabled)

Comparative Summary

Unit Tests

Use Case

Design Decisions

How to Reproduce

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

dpateriya commented Jun 11, 2026

Uh oh!

tchap commented Jun 11, 2026

Uh oh!

swshende-cmd commented Jun 11, 2026

Uh oh!

tchap commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

swshende-cmd commented Jun 11, 2026

Uh oh!

tchap Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

swshende-cmd commented Jun 11, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

swshende-cmd commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading