Skip to content

must-gather: add client-side keep-alive to prevent accessTokenInactivityTimeout failures#2288

Open
swshende-cmd wants to merge 4 commits into
openshift:mainfrom
swshende-cmd:rfe8595-must-gather-keepalive
Open

must-gather: add client-side keep-alive to prevent accessTokenInactivityTimeout failures#2288
swshende-cmd wants to merge 4 commits into
openshift:mainfrom
swshende-cmd:rfe8595-must-gather-keepalive

Conversation

@swshende-cmd

@swshende-cmd swshende-cmd commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Add a client-side keep-alive mechanism to oc adm must-gather that prevents OAuth access token
revocation due to accessTokenInactivityTimeout on hardened clusters.

Problem: When accessTokenInactivityTimeout is configured (e.g., 300s for regulated
environments), oc adm must-gather fails with "Unauthorized" errors mid-collection. The root
cause is that the oc client holds a single long-lived HTTP log-follow stream to watch the
gather pod. The API server does not count this persistent connection as "activity" for the OAuth
inactivity timer, causing the token to be revoked.

Fix: A background goroutine that calls Discovery().ServerVersion() (authenticated
GET /version) every 30 seconds during the gather lifecycle. This resets the OAuth inactivity
timer on each call, keeping the session alive for the full duration of the operation.

Resolves: RFE-8595

What Changed

Two additions to pkg/cli/admin/mustgather/mustgather.go:

  1. startClientKeepAlive(ctx) — Background goroutine making periodic authenticated API
    probes (every 30s) to reset the OAuth token inactivity timer.

  2. Integration in processNextWorkItem() — Started after pod creation, stopped via
    defer after data copy completes. Covers: pod startup wait -> log streaming -> gather
    completion wait -> rsync download.

stopKeepAlive := o.startClientKeepAlive(ctx)
defer stopKeepAlive()

Test Evidence

Environment

  • Cluster: OCP 4.20.0 (Kubernetes v1.33.11)
  • OAuth config: accessTokenInactivityTimeout: 300s
  • Token: inactivityTimeoutSeconds: 300 (confirmed on OAuthAccessToken object)
  • Test scenario: sleep 360 (6-min idle gap) injected before /usr/bin/gather
    — forces a window with zero client-side API calls, exceeding the 5-minute timeout

WITHOUT the fix (unpatched oc)

$ ./oc-unpatched adm must-gather --timeout=15m \
    -- '/bin/bash -c "echo GATHER_START; sleep 360; echo GATHER_AFTER_SLEEP; /usr/bin/gather"'

[must-gather-5npp9] POD 05:30:13Z GATHER_START
[must-gather-5npp9] POD 05:36:13Z GATHER_AFTER_SLEEP
[must-gather-5npp9] OUT 05:43:22Z gather logs unavailable: Unauthorized
[must-gather-5npp9] OUT 05:43:22Z waiting for gather to complete
[must-gather-5npp9] OUT 05:58:22Z gather never finished: context deadline exceeded
Unauthorized
error getting cluster version: Unauthorized
error getting cluster operators: Unauthorized
error: gather never finished for pod must-gather-5npp9: context deadline exceeded
Metric Result
Exit code 1 (FAILURE)
Duration 28 min (hung on deadline)
Data collected 12 KB (3 files - empty shell)
Unauthorized errors 4

WITH the fix (patched oc - keep-alive enabled)

$ ./oc adm must-gather --timeout=15m \
    -- '/bin/bash -c "echo GATHER_START; sleep 360; echo GATHER_AFTER_SLEEP; /usr/bin/gather"'

[must-gather-7cv6s] POD 04:04:28Z GATHER_START
[must-gather-7cv6s] POD 04:10:28Z GATHER_AFTER_SLEEP
[must-gather-7cv6s] OUT 04:17:29Z waiting for gather to complete
[must-gather-7cv6s] OUT 04:17:29Z downloading gather output
... (full collection completes successfully) ...
Metric Result
Exit code 0 (SUCCESS)
Duration 37 min
Data collected 6.4 GB (8,489 files)
Unauthorized errors 0

Comparative Summary

Without Fix With Fix
Exit code 1 (failure) 0 (success)
Data collected 12 KB 6.4 GB
Files 3 8,489
Unauthorized errors 4 0
Token survived 300s policy? Revoked at ~5 min Alive for 37 min

Unit Tests

$ go test ./pkg/cli/admin/mustgather/ -v -count=1

=== RUN   TestStartClientKeepAlive
=== RUN   TestStartClientKeepAlive/makes_periodic_authenticated_API_calls
=== RUN   TestStartClientKeepAlive/stops_when_cancel_is_called
--- PASS: TestStartClientKeepAlive (35.10s)
PASS
ok      github.com/openshift/oc/pkg/cli/admin/mustgather    37.265s

Use Case

This addresses environments where short accessTokenInactivityTimeout values (e.g., 5-10
minutes) are mandated by regulatory or security compliance policies (Banking, Government,
PCI-DSS). Previously, administrators in such environments could not run oc adm must-gather
without either:

  • Temporarily relaxing the timeout (violating security policy), or
  • Running a manual workaround (while true; do oc whoami; sleep 60; done) in the background

This fix eliminates both workarounds by building the equivalent heartbeat directly into oc.

Design Decisions

  • 30s interval — Well within any reasonable timeout (minimum ~60s), fires ~10 times before
    a 5-minute timeout would expire
  • Discovery().ServerVersion() — Lightest possible authenticated API call; single GET,
    no list/watch, no resource creation
  • Non-fatal errors — Probe failures are logged at V(5) only; transient network blips
    won't kill the gather
  • Lifecycle scoped — Starts at pod creation, stops at data download completion via defer
  • Zero configuration — Always-on, no flags needed, fully backward-compatible
  • No cluster-side changes — Entirely client-side fix; no admin action required on the cluster

How to Reproduce

# 1. Set 5-min inactivity timeout
oc patch oauth.config.openshift.io cluster --type=merge \
  -p '{"spec":{"tokenConfig":{"accessTokenInactivityTimeout":"300s"}}}'

# 2. Wait for rollout
oc rollout status deployment/oauth-openshift -n openshift-authentication --timeout=300s

# 3. Force fresh login (new token gets the policy)
oc logout && oc login -u <user> -p <pass> <api-url> --insecure-skip-tls-verify

# 4. Run must-gather with idle gap exceeding timeout
oc adm must-gather --timeout=15m \
  -- '/bin/bash -c "sleep 360; /usr/bin/gather"'

# 5. Restore
oc patch oauth.config.openshift.io cluster --type=json \
  -p '[{"op":"remove","path":"/spec/tokenConfig"}]'

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of must-gather during long-running collections by maintaining an authenticated background probe.
  • Configuration
    • Added a configurable keep-alive interval to control probe frequency during gathers.
  • Tests
    • Added unit tests confirming background probe activity runs during gathers and stops when canceled.

…ityTimeout failures

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b8a26114-6b78-4607-a5ed-66d974e7bd09

📥 Commits

Reviewing files that changed from the base of the PR and between 11351c1 and 74a3113.

📒 Files selected for processing (2)
  • pkg/cli/admin/mustgather/mustgather.go
  • pkg/cli/admin/mustgather/mustgather_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cli/admin/mustgather/mustgather_test.go
  • pkg/cli/admin/mustgather/mustgather.go

Walkthrough

Adds a cancellable background keep-alive that periodically calls the Kubernetes discovery API (ServerVersion) to prevent OAuth token inactivity during long-running must-gather log-follow operations; wires it into pod processing and adds a unit test.

Changes

OAuth Keep-Alive Mechanism

Layer / File(s) Summary
Keep-alive constant and options
pkg/cli/admin/mustgather/mustgather.go
Adds defaultKeepAliveInterval and extends MustGatherOptions with a keepAliveInterval field.
Keep-alive implementation
pkg/cli/admin/mustgather/mustgather.go
Adds (*MustGatherOptions).startClientKeepAlive(ctx) which starts a cancellable ticker goroutine that periodically calls Client.Discovery().ServerVersion().
Integration into pod processing
pkg/cli/admin/mustgather/mustgather.go
Starts the keep-alive before waiting for the gather container and defers cancellation until after gather completion and local file copy.
Unit tests
pkg/cli/admin/mustgather/mustgather_test.go
Adds TestStartClientKeepAlive with subtests verifying repeated discovery calls while active and no further calls after cancellation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error startClientKeepAlive logs raw discovery error with klog.V(2).Infof("keep-alive probe failed...: %v", err) which may include internal hostnames/other sensitive details in the error string. In keep-alive failure logging, avoid printing raw err; log a sanitized/generic message (optionally error type) so error strings don’t leak internal hostnames/tokens.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a client-side keep-alive mechanism to prevent OAuth access token inactivity timeout failures during must-gather operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Reviewed pkg/cli/admin/mustgather/mustgather_test.go: only subtests use static strings (“makes periodic authenticated API calls”, “stops when context is cancelled”); no dynamic pod/timestamp/UUID d...
Test Structure And Quality ✅ Passed pkg/cli/admin/mustgather/mustgather_test.go has TestStartClientKeepAlive with two focused t.Run subtests; no cluster resources/timeouts/E2E waits, and failure messages are provided.
Microshift Test Compatibility ✅ Passed PR #2288 modifies only mustgather.go and mustgather_test.go; no new Ginkgo/Describe/It e2e tests referencing unsupported MicroShift APIs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR #2288 only modifies must-gather CLI/unit test (pkg/cli/admin/mustgather/mustgather.go, mustgather_test.go); no new Ginkgo e2e tests or multi-node/HA SNO assumptions to flag. citeturn3view0tur...
Topology-Aware Scheduling Compatibility ✅ Passed PR changes only add a MustGatherOptions keep-alive goroutine (Discovery().ServerVersion) and its unit test; no added scheduling constraints/affinity/topology-aware replica logic in manifests or con...
Ote Binary Stdout Contract ✅ Passed In changed files, there are no init/main/TestMain/Ginkgo suite entrypoints, and any fmt.Printf appears inside functions (e.g., cleanup), not at top level; no klog.SetOutput/LogToStderr found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR #2288 only changes mustgather.go and mustgather_test.go; no new Ginkgo e2e tests found (no “ginkgo”/“It(“ in diff), and the added test uses a fake clientset.
No-Weak-Crypto ✅ Passed Scanned pkg/cli/admin/mustgather/mustgather.go and _test.go for MD5/SHA1/DES/RC4/3DES/Blowfish/ECB and crypto imports; none found, and no token/secret == comparisons detected.
Container-Privileges ✅ Passed PR #2288 only changes pkg/cli/admin/mustgather/mustgather.go and mustgather_test.go; the PR diff contains no privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalation fields.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from ingvagabund and tchap June 11, 2026 11:19
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swshende-cmd
Once this PR has been reviewed and has the lgtm label, please assign atiratree for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @swshende-cmd. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 703-753: The test should drive the keep-alive ticker instead of
sleeping; add a testable hook to MustGatherOptions (e.g., a field
KeepAliveInterval time.Duration or a TickerFactory/Clock) and update
startClientKeepAlive to use that hook instead of the hard-coded
defaultKeepAliveInterval (or direct time.NewTicker). Change startClientKeepAlive
to accept or read the injected interval/ticker and use
time.NewTicker(o.KeepAliveInterval) (or the provided ticker) so tests can set a
short interval or provide a controllable ticker; update the test to set
o.KeepAliveInterval to a small duration (or supply a fake ticker) and assert
that fakeClient.Actions() increases only during driven ticks and remains flat
after calling stopKeepAlive().

In `@pkg/cli/admin/mustgather/mustgather.go`:
- Around line 829-840: startClientKeepAlive currently calls
o.Client.Discovery().ServerVersion() which uses context.TODO() so in-flight
probes won’t cancel; change the probe to derive a short probeCtx from the
keep-alive ctx using context.WithTimeout(ctx, <shortDuration>) and cancel it
after the call, and perform the probe using
o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the
ServerVersionContext equivalent if available) so each keep-alive run respects
cancellation and a bounded deadline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a97219d1-9b54-43b7-8eaf-fd2e641f9db0

📥 Commits

Reviewing files that changed from the base of the PR and between 4007283 and 9d8933f.

📒 Files selected for processing (2)
  • pkg/cli/admin/mustgather/mustgather.go
  • pkg/cli/admin/mustgather/mustgather_test.go

Comment thread pkg/cli/admin/mustgather/mustgather_test.go
Comment thread pkg/cli/admin/mustgather/mustgather.go Outdated
Comment on lines +829 to +840
ticker := time.NewTicker(defaultKeepAliveInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
// A lightweight authenticated GET that resets the OAuth
// inactivity timer. Errors are expected if the cluster is
// temporarily unreachable and are not actionable here.
if _, err := o.Client.Discovery().ServerVersion(); err != nil {
klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate keep-alive implementation and relevant constants.
rg -n "defaultKeepAliveInterval|startClientKeepAlive|ServerVersion\\(|keep-alive|keepAlive" pkg/cli/admin/mustgather/mustgather.go

# Show the exact keep-alive probe loop and the surrounding function(s).
sed -n '780,920p' pkg/cli/admin/mustgather/mustgather.go

# Also show the full startClientKeepAlive function if it spans outside the snippet.
rg -n "func \\(o \\*MustGatherOptions\\) startClientKeepAlive" -n pkg/cli/admin/mustgather/mustgather.go
python3 - <<'PY'
import re, pathlib
p = pathlib.Path("pkg/cli/admin/mustgather/mustgather.go")
txt = p.read_text()
m = re.search(r'func \\(o \\*MustGatherOptions\\) startClientKeepAlive\\b[\\s\\S]*?\\n}\\n', txt)
if m:
    print(m.group(0))
else:
    print("startClientKeepAlive function body not found by regex")
PY

# Check for any HTTP client / rest.Config timeout setup on o.Client elsewhere in mustgather.go.
rg -n "Timeout:|Dial|TLSHandshake|ResponseHeader|IdleConn|Transport|http\\.Client|rest\\.Config|WrapTransport|QPS|Burst" pkg/cli/admin/mustgather/mustgather.go
sed -n '1,260p' pkg/cli/admin/mustgather/mustgather.go

Repository: openshift/oc

Length of output: 17216


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if the discovery interface offers a context-aware ServerVersion call.
rg -n "ServerVersionWithContext|type DiscoveryInterface|ServerVersion\\(" -S vendor/k8s.io/client-go 2>/dev/null || true
rg -n "ServerVersionWithContext|ServerVersion\\(" -S . | head -n 50

# Check if mustgather or its options set rest.Config timeouts / transport timeouts.
rg -n "Config\\.Timeout|o\\.Config\\.Timeout|rest\\.Config\\{[^}]*Timeout|Timeout:\\s*.*time\\.Second|http\\.Client\\{[^}]*Timeout" pkg/cli/admin/mustgather/mustgather.go

# Inspect MustGatherOptions definition for any HTTP/rest timeout customization fields.
rg -n "type MustGatherOptions struct" -n pkg/cli/admin/mustgather/mustgather.go
sed -n '380,520p' pkg/cli/admin/mustgather/mustgather.go

# Search for any factory/rest config tweaking patterns in this command.
rg -n "ToRESTConfig\\(|NewForConfig\\(|WithTimeout|Timeout:" pkg/cli/admin/mustgather/mustgather.go

Repository: openshift/oc

Length of output: 4209


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect how DiscoveryClient.ServerVersion() performs the request.
sed -n '640,740p' vendor/k8s.io/client-go/discovery/discovery_client.go

# Look for any context usage in that function.
rg -n "func \\(d \\*DiscoveryClient\\) ServerVersion\\(" -n vendor/k8s.io/client-go/discovery/discovery_client.go
rg -n "Do\\(|WithContext|context\\." vendor/k8s.io/client-go/discovery/discovery_client.go | head -n 50

# Check whether rest.Config.Timeout exists/gets applied (HTTP client timeout).
rg -n "type Config struct" vendor/k8s.io/client-go/rest/config.go
rg -n "Timeout" vendor/k8s.io/client-go/rest/config.go | head -n 80
rg -n "HTTPClient.*Timeout|Timeout.*http\\.Client|Client\\.Timeout" -S vendor/k8s.io/client-go/rest | head -n 80

# Search this repo for rest.Config.Timeout being set (to see if factory/oc wiring sets it).
rg -n "Config\\.Timeout\\s*=" pkg cmd hack | head -n 50
rg -n "HTTPClient\\{.*Timeout|http\\.Client\\{.*Timeout" pkg cmd hack | head -n 50

Repository: openshift/oc

Length of output: 4459


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find defaultTimeout value in discovery client code.
rg -n "defaultTimeout|setDiscoveryDefaults\\(" vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '1,120p' vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '120,220p' vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '560,760p' vendor/k8s.io/client-go/discovery/discovery_client.go

# Check how kubernetes clientset builds the discovery client and whether it calls setDiscoveryDefaults.
rg -n "setDiscoveryDefaults\\(" vendor/k8s.io/client-go | head -n 50
rg -n "DiscoveryClient" vendor/k8s.io/client-go/kubernetes -S | head -n 50

# Inspect rest/request.go timeout behavior (HTTP client timeout source).
sed -n '120,220p' vendor/k8s.io/client-go/rest/request.go

Repository: openshift/oc

Length of output: 20771


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for must-gather specific config flags or request-timeout wiring.
rg -n "request-timeout|RequestTimeout|ConfigFlags|genericclioptions\\.ConfigFlags" pkg/cli/admin/mustgather/mustgather.go

# See if oc globally supports --request-timeout and how it maps to rest.Config.Timeout.
rg -n "\"request-timeout\"|request-timeout" pkg cmd | head -n 50

# Inspect any shared config flag code that sets rest.Config.Timeout (might be in pkg/helpers/cmd or similar).
rg -n "Config\\.Timeout\\s*=|clientConfig\\.Timeout\\s*=" pkg cmd | head -n 50

# If we can find a config flag named RequestTimeout, read its implementation.
rg -n "type .*RequestTimeout|RequestTimeout" pkg cmd | head -n 50

Repository: openshift/oc

Length of output: 38


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Confirm kubernetes clientset's Discovery() method type.
rg -n "func \\(.*\\) Discovery\\(" vendor/k8s.io/client-go/kubernetes/clientset.go
sed -n '450,520p' vendor/k8s.io/client-go/kubernetes/clientset.go

Repository: openshift/oc

Length of output: 2756


Use context/timeout for each keep-alive probe (don’t rely on discovery’s internal context.TODO()).

startClientKeepAlive cancels the ticker via ctx.Done(), but o.Client.Discovery().ServerVersion() issues the /version request with context.TODO(), so the in-flight probe won’t stop when you cancel stopKeepAlive(). Although client-go discovery applies a per-request HTTP timeout (defaults to 32s when unset), each probe should be executed with a probeCtx that derives from the keep-alive ctx and has a short deadline (e.g. via o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx)).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/mustgather/mustgather.go` around lines 829 - 840,
startClientKeepAlive currently calls o.Client.Discovery().ServerVersion() which
uses context.TODO() so in-flight probes won’t cancel; change the probe to derive
a short probeCtx from the keep-alive ctx using context.WithTimeout(ctx,
<shortDuration>) and cancel it after the call, and perform the probe using
o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the
ServerVersionContext equivalent if available) so each keep-alive run respects
cancellation and a bounded deadline.

Source: Coding guidelines

@dpateriya

Copy link
Copy Markdown

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 11, 2026
@tchap

tchap commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Shouldn't the RFE be approved first? Not saying this is not a good idea...

@swshende-cmd

Copy link
Copy Markdown
Author

@tchap Hi, I agree with you, and apologies if I have missed the workflow here, I am happy to be corrected for the right workflow to be followed and if you can direct me accordingly, that ll be helpful.

- Make keep-alive interval configurable via unexported field for testability
- Add context cancellation check before each probe to ensure clean shutdown
- Filter context.Canceled errors from non-fatal log output
- Reduce test runtime from 35s to <1s by using 50ms interval in tests

Co-authored-by: Cursor <cursoragent@cursor.com>
@tchap

tchap commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@swshende-cmd No big deal, just good to clarify the situation, probably in the associated Jira issue comments. Then you have much higher probability of your PR being accepted, when the RFE is marked to be implemented.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/cli/admin/mustgather/mustgather_test.go (1)

741-746: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Flaky stop assertion due immediate post-cancel snapshot.

On Line 744, before := len(fakeClient.Actions()) is taken immediately after stopKeepAlive(). Because cancellation is async, one in-flight probe can still land, so Line 748 may fail nondeterministically even when shutdown is correct.

Suggested stabilization
 		stopKeepAlive()
-
-		before := len(fakeClient.Actions())
-		time.Sleep(200 * time.Millisecond)
+		// Allow any in-flight probe selected before cancel to complete.
+		time.Sleep(75 * time.Millisecond)
+		before := len(fakeClient.Actions())
+		time.Sleep(200 * time.Millisecond)
 		after := len(fakeClient.Actions())

 		if after != before {
 			t.Errorf("keep-alive goroutine continued after cancel: actions before=%d, after=%d", before, after)
 		}

Also applies to: 748-750

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/mustgather/mustgather_test.go` around lines 741 - 746, The test
takes a snapshot of fakeClient.Actions() immediately after calling
stopKeepAlive(), which races with async cancellation and can observe an
in-flight probe; change the assertion to poll until the recorded action count
has stabilized instead of sampling immediately: after calling stopKeepAlive()
call a small retry loop (with a short sleep, e.g., 10–20ms, and overall timeout)
that reads fakeClient.Actions() repeatedly until two consecutive reads are
equal, then use that stable value as `before` and assert no further actions are
added (or capture `after` similarly); reference stopKeepAlive() and
fakeClient.Actions() when implementing the polling/stabilization in
mustgather_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 741-746: The test takes a snapshot of fakeClient.Actions()
immediately after calling stopKeepAlive(), which races with async cancellation
and can observe an in-flight probe; change the assertion to poll until the
recorded action count has stabilized instead of sampling immediately: after
calling stopKeepAlive() call a small retry loop (with a short sleep, e.g.,
10–20ms, and overall timeout) that reads fakeClient.Actions() repeatedly until
two consecutive reads are equal, then use that stable value as `before` and
assert no further actions are added (or capture `after` similarly); reference
stopKeepAlive() and fakeClient.Actions() when implementing the
polling/stabilization in mustgather_test.go.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e232dcc2-e76e-49eb-b326-677bfebea853

📥 Commits

Reviewing files that changed from the base of the PR and between 9d8933f and a6c1c51.

📒 Files selected for processing (2)
  • pkg/cli/admin/mustgather/mustgather.go
  • pkg/cli/admin/mustgather/mustgather_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cli/admin/mustgather/mustgather.go

@swshende-cmd

Copy link
Copy Markdown
Author

@tchap right 👍
The intent was to see if I could contribute as it has been in backlog for sometime..
Would you be able to please review this?

Comment thread pkg/cli/admin/mustgather/mustgather.go Outdated
//
// Returns a cancel function that stops the goroutine. The caller must invoke it
// when the keep-alive is no longer needed.
func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can either use the ctx argument or the cancel func, but it doesn't make sense to use both. I lean towards just using ctx and letting the caller cancel the context to stop the goroutine.

Comment thread pkg/cli/admin/mustgather/mustgather.go Outdated
Comment on lines +72 to +845
@@ -445,8 +453,9 @@ type MustGatherOptions struct {
Since time.Duration
SinceTime string

RsyncRshCmd string
clock clock.PassiveClock
RsyncRshCmd string
keepAliveInterval time.Duration
clock clock.PassiveClock

PrinterCreated printers.ResourcePrinter
PrinterDeleted printers.ResourcePrinter
@@ -805,6 +814,42 @@ func (o *MustGatherOptions) Run() error {
return kutilerrors.NewAggregate(errs)
}

// startClientKeepAlive spawns a background goroutine that periodically makes
// an authenticated API call to prevent the user's OAuth access token from being
// revoked due to accessTokenInactivityTimeout. The log-follow connection held
// by getGatherContainerLogs is a single long-lived HTTP stream that the API
// server does not count as discrete activity, so without these probes the token
// can expire on clusters with short inactivity windows (e.g. 5-10 minutes in
// banking/government/PCI-DSS environments).
//
// Returns a cancel function that stops the goroutine. The caller must invoke it
// when the keep-alive is no longer needed.
func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc {
ctx, cancel := context.WithCancel(ctx)
interval := o.keepAliveInterval
if interval == 0 {
interval = defaultKeepAliveInterval
}
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if ctx.Err() != nil {
return
}
if _, err := o.Client.Discovery().ServerVersion(); err != nil && !errors.Is(err, context.Canceled) {
klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it like V(2), this is effectively invisible.

@swshende-cmd

Copy link
Copy Markdown
Author

@tchap sure, I have updated and added proposal and test evidences to the RFE too.
Checking you latest review comments and addressing

shendeswapnil6 and others added 2 commits June 11, 2026 21:44
- Remove internal context.WithCancel; let caller control goroutine
  lifetime via ctx cancellation (simpler API, no redundant cancel func)
- Change keep-alive probe log level from V(5) to V(2) for better
  visibility during troubleshooting

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@swshende-cmd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial-1of2 74a3113 link true /test e2e-aws-ovn-serial-1of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants