must-gather: add client-side keep-alive to prevent accessTokenInactivityTimeout failures#2288
must-gather: add client-side keep-alive to prevent accessTokenInactivityTimeout failures#2288swshende-cmd wants to merge 4 commits into
Conversation
…ityTimeout failures Co-authored-by: Cursor <cursoragent@cursor.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds a cancellable background keep-alive that periodically calls the Kubernetes discovery API (ServerVersion) to prevent OAuth token inactivity during long-running must-gather log-follow operations; wires it into pod processing and adds a unit test. ChangesOAuth Keep-Alive Mechanism
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: swshende-cmd The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @swshende-cmd. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 703-753: The test should drive the keep-alive ticker instead of
sleeping; add a testable hook to MustGatherOptions (e.g., a field
KeepAliveInterval time.Duration or a TickerFactory/Clock) and update
startClientKeepAlive to use that hook instead of the hard-coded
defaultKeepAliveInterval (or direct time.NewTicker). Change startClientKeepAlive
to accept or read the injected interval/ticker and use
time.NewTicker(o.KeepAliveInterval) (or the provided ticker) so tests can set a
short interval or provide a controllable ticker; update the test to set
o.KeepAliveInterval to a small duration (or supply a fake ticker) and assert
that fakeClient.Actions() increases only during driven ticks and remains flat
after calling stopKeepAlive().
In `@pkg/cli/admin/mustgather/mustgather.go`:
- Around line 829-840: startClientKeepAlive currently calls
o.Client.Discovery().ServerVersion() which uses context.TODO() so in-flight
probes won’t cancel; change the probe to derive a short probeCtx from the
keep-alive ctx using context.WithTimeout(ctx, <shortDuration>) and cancel it
after the call, and perform the probe using
o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the
ServerVersionContext equivalent if available) so each keep-alive run respects
cancellation and a bounded deadline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a97219d1-9b54-43b7-8eaf-fd2e641f9db0
📒 Files selected for processing (2)
pkg/cli/admin/mustgather/mustgather.gopkg/cli/admin/mustgather/mustgather_test.go
| ticker := time.NewTicker(defaultKeepAliveInterval) | ||
| defer ticker.Stop() | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case <-ticker.C: | ||
| // A lightweight authenticated GET that resets the OAuth | ||
| // inactivity timer. Errors are expected if the cluster is | ||
| // temporarily unreachable and are not actionable here. | ||
| if _, err := o.Client.Discovery().ServerVersion(); err != nil { | ||
| klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate keep-alive implementation and relevant constants.
rg -n "defaultKeepAliveInterval|startClientKeepAlive|ServerVersion\\(|keep-alive|keepAlive" pkg/cli/admin/mustgather/mustgather.go
# Show the exact keep-alive probe loop and the surrounding function(s).
sed -n '780,920p' pkg/cli/admin/mustgather/mustgather.go
# Also show the full startClientKeepAlive function if it spans outside the snippet.
rg -n "func \\(o \\*MustGatherOptions\\) startClientKeepAlive" -n pkg/cli/admin/mustgather/mustgather.go
python3 - <<'PY'
import re, pathlib
p = pathlib.Path("pkg/cli/admin/mustgather/mustgather.go")
txt = p.read_text()
m = re.search(r'func \\(o \\*MustGatherOptions\\) startClientKeepAlive\\b[\\s\\S]*?\\n}\\n', txt)
if m:
print(m.group(0))
else:
print("startClientKeepAlive function body not found by regex")
PY
# Check for any HTTP client / rest.Config timeout setup on o.Client elsewhere in mustgather.go.
rg -n "Timeout:|Dial|TLSHandshake|ResponseHeader|IdleConn|Transport|http\\.Client|rest\\.Config|WrapTransport|QPS|Burst" pkg/cli/admin/mustgather/mustgather.go
sed -n '1,260p' pkg/cli/admin/mustgather/mustgather.goRepository: openshift/oc
Length of output: 17216
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if the discovery interface offers a context-aware ServerVersion call.
rg -n "ServerVersionWithContext|type DiscoveryInterface|ServerVersion\\(" -S vendor/k8s.io/client-go 2>/dev/null || true
rg -n "ServerVersionWithContext|ServerVersion\\(" -S . | head -n 50
# Check if mustgather or its options set rest.Config timeouts / transport timeouts.
rg -n "Config\\.Timeout|o\\.Config\\.Timeout|rest\\.Config\\{[^}]*Timeout|Timeout:\\s*.*time\\.Second|http\\.Client\\{[^}]*Timeout" pkg/cli/admin/mustgather/mustgather.go
# Inspect MustGatherOptions definition for any HTTP/rest timeout customization fields.
rg -n "type MustGatherOptions struct" -n pkg/cli/admin/mustgather/mustgather.go
sed -n '380,520p' pkg/cli/admin/mustgather/mustgather.go
# Search for any factory/rest config tweaking patterns in this command.
rg -n "ToRESTConfig\\(|NewForConfig\\(|WithTimeout|Timeout:" pkg/cli/admin/mustgather/mustgather.goRepository: openshift/oc
Length of output: 4209
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect how DiscoveryClient.ServerVersion() performs the request.
sed -n '640,740p' vendor/k8s.io/client-go/discovery/discovery_client.go
# Look for any context usage in that function.
rg -n "func \\(d \\*DiscoveryClient\\) ServerVersion\\(" -n vendor/k8s.io/client-go/discovery/discovery_client.go
rg -n "Do\\(|WithContext|context\\." vendor/k8s.io/client-go/discovery/discovery_client.go | head -n 50
# Check whether rest.Config.Timeout exists/gets applied (HTTP client timeout).
rg -n "type Config struct" vendor/k8s.io/client-go/rest/config.go
rg -n "Timeout" vendor/k8s.io/client-go/rest/config.go | head -n 80
rg -n "HTTPClient.*Timeout|Timeout.*http\\.Client|Client\\.Timeout" -S vendor/k8s.io/client-go/rest | head -n 80
# Search this repo for rest.Config.Timeout being set (to see if factory/oc wiring sets it).
rg -n "Config\\.Timeout\\s*=" pkg cmd hack | head -n 50
rg -n "HTTPClient\\{.*Timeout|http\\.Client\\{.*Timeout" pkg cmd hack | head -n 50Repository: openshift/oc
Length of output: 4459
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find defaultTimeout value in discovery client code.
rg -n "defaultTimeout|setDiscoveryDefaults\\(" vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '1,120p' vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '120,220p' vendor/k8s.io/client-go/discovery/discovery_client.go
sed -n '560,760p' vendor/k8s.io/client-go/discovery/discovery_client.go
# Check how kubernetes clientset builds the discovery client and whether it calls setDiscoveryDefaults.
rg -n "setDiscoveryDefaults\\(" vendor/k8s.io/client-go | head -n 50
rg -n "DiscoveryClient" vendor/k8s.io/client-go/kubernetes -S | head -n 50
# Inspect rest/request.go timeout behavior (HTTP client timeout source).
sed -n '120,220p' vendor/k8s.io/client-go/rest/request.goRepository: openshift/oc
Length of output: 20771
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for must-gather specific config flags or request-timeout wiring.
rg -n "request-timeout|RequestTimeout|ConfigFlags|genericclioptions\\.ConfigFlags" pkg/cli/admin/mustgather/mustgather.go
# See if oc globally supports --request-timeout and how it maps to rest.Config.Timeout.
rg -n "\"request-timeout\"|request-timeout" pkg cmd | head -n 50
# Inspect any shared config flag code that sets rest.Config.Timeout (might be in pkg/helpers/cmd or similar).
rg -n "Config\\.Timeout\\s*=|clientConfig\\.Timeout\\s*=" pkg cmd | head -n 50
# If we can find a config flag named RequestTimeout, read its implementation.
rg -n "type .*RequestTimeout|RequestTimeout" pkg cmd | head -n 50Repository: openshift/oc
Length of output: 38
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Confirm kubernetes clientset's Discovery() method type.
rg -n "func \\(.*\\) Discovery\\(" vendor/k8s.io/client-go/kubernetes/clientset.go
sed -n '450,520p' vendor/k8s.io/client-go/kubernetes/clientset.goRepository: openshift/oc
Length of output: 2756
Use context/timeout for each keep-alive probe (don’t rely on discovery’s internal context.TODO()).
startClientKeepAlive cancels the ticker via ctx.Done(), but o.Client.Discovery().ServerVersion() issues the /version request with context.TODO(), so the in-flight probe won’t stop when you cancel stopKeepAlive(). Although client-go discovery applies a per-request HTTP timeout (defaults to 32s when unset), each probe should be executed with a probeCtx that derives from the keep-alive ctx and has a short deadline (e.g. via o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx)).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/cli/admin/mustgather/mustgather.go` around lines 829 - 840,
startClientKeepAlive currently calls o.Client.Discovery().ServerVersion() which
uses context.TODO() so in-flight probes won’t cancel; change the probe to derive
a short probeCtx from the keep-alive ctx using context.WithTimeout(ctx,
<shortDuration>) and cancel it after the call, and perform the probe using
o.Client.Discovery().RESTClient().Get().AbsPath("/version").Do(probeCtx) (or the
ServerVersionContext equivalent if available) so each keep-alive run respects
cancellation and a bounded deadline.
Source: Coding guidelines
|
/ok-to-test |
|
Shouldn't the RFE be approved first? Not saying this is not a good idea... |
|
@tchap Hi, I agree with you, and apologies if I have missed the workflow here, I am happy to be corrected for the right workflow to be followed and if you can direct me accordingly, that ll be helpful. |
- Make keep-alive interval configurable via unexported field for testability - Add context cancellation check before each probe to ensure clean shutdown - Filter context.Canceled errors from non-fatal log output - Reduce test runtime from 35s to <1s by using 50ms interval in tests Co-authored-by: Cursor <cursoragent@cursor.com>
|
@swshende-cmd No big deal, just good to clarify the situation, probably in the associated Jira issue comments. Then you have much higher probability of your PR being accepted, when the RFE is marked to be implemented. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/cli/admin/mustgather/mustgather_test.go (1)
741-746:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFlaky stop assertion due immediate post-cancel snapshot.
On Line 744,
before := len(fakeClient.Actions())is taken immediately afterstopKeepAlive(). Because cancellation is async, one in-flight probe can still land, so Line 748 may fail nondeterministically even when shutdown is correct.Suggested stabilization
stopKeepAlive() - - before := len(fakeClient.Actions()) - time.Sleep(200 * time.Millisecond) + // Allow any in-flight probe selected before cancel to complete. + time.Sleep(75 * time.Millisecond) + before := len(fakeClient.Actions()) + time.Sleep(200 * time.Millisecond) after := len(fakeClient.Actions()) if after != before { t.Errorf("keep-alive goroutine continued after cancel: actions before=%d, after=%d", before, after) }Also applies to: 748-750
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/mustgather/mustgather_test.go` around lines 741 - 746, The test takes a snapshot of fakeClient.Actions() immediately after calling stopKeepAlive(), which races with async cancellation and can observe an in-flight probe; change the assertion to poll until the recorded action count has stabilized instead of sampling immediately: after calling stopKeepAlive() call a small retry loop (with a short sleep, e.g., 10–20ms, and overall timeout) that reads fakeClient.Actions() repeatedly until two consecutive reads are equal, then use that stable value as `before` and assert no further actions are added (or capture `after` similarly); reference stopKeepAlive() and fakeClient.Actions() when implementing the polling/stabilization in mustgather_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/cli/admin/mustgather/mustgather_test.go`:
- Around line 741-746: The test takes a snapshot of fakeClient.Actions()
immediately after calling stopKeepAlive(), which races with async cancellation
and can observe an in-flight probe; change the assertion to poll until the
recorded action count has stabilized instead of sampling immediately: after
calling stopKeepAlive() call a small retry loop (with a short sleep, e.g.,
10–20ms, and overall timeout) that reads fakeClient.Actions() repeatedly until
two consecutive reads are equal, then use that stable value as `before` and
assert no further actions are added (or capture `after` similarly); reference
stopKeepAlive() and fakeClient.Actions() when implementing the
polling/stabilization in mustgather_test.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e232dcc2-e76e-49eb-b326-677bfebea853
📒 Files selected for processing (2)
pkg/cli/admin/mustgather/mustgather.gopkg/cli/admin/mustgather/mustgather_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cli/admin/mustgather/mustgather.go
|
@tchap right 👍 |
| // | ||
| // Returns a cancel function that stops the goroutine. The caller must invoke it | ||
| // when the keep-alive is no longer needed. | ||
| func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc { |
There was a problem hiding this comment.
You can either use the ctx argument or the cancel func, but it doesn't make sense to use both. I lean towards just using ctx and letting the caller cancel the context to stop the goroutine.
| @@ -445,8 +453,9 @@ type MustGatherOptions struct { | |||
| Since time.Duration | |||
| SinceTime string | |||
|
|
|||
| RsyncRshCmd string | |||
| clock clock.PassiveClock | |||
| RsyncRshCmd string | |||
| keepAliveInterval time.Duration | |||
| clock clock.PassiveClock | |||
|
|
|||
| PrinterCreated printers.ResourcePrinter | |||
| PrinterDeleted printers.ResourcePrinter | |||
| @@ -805,6 +814,42 @@ func (o *MustGatherOptions) Run() error { | |||
| return kutilerrors.NewAggregate(errs) | |||
| } | |||
|
|
|||
| // startClientKeepAlive spawns a background goroutine that periodically makes | |||
| // an authenticated API call to prevent the user's OAuth access token from being | |||
| // revoked due to accessTokenInactivityTimeout. The log-follow connection held | |||
| // by getGatherContainerLogs is a single long-lived HTTP stream that the API | |||
| // server does not count as discrete activity, so without these probes the token | |||
| // can expire on clusters with short inactivity windows (e.g. 5-10 minutes in | |||
| // banking/government/PCI-DSS environments). | |||
| // | |||
| // Returns a cancel function that stops the goroutine. The caller must invoke it | |||
| // when the keep-alive is no longer needed. | |||
| func (o *MustGatherOptions) startClientKeepAlive(ctx context.Context) context.CancelFunc { | |||
| ctx, cancel := context.WithCancel(ctx) | |||
| interval := o.keepAliveInterval | |||
| if interval == 0 { | |||
| interval = defaultKeepAliveInterval | |||
| } | |||
| go func() { | |||
| ticker := time.NewTicker(interval) | |||
| defer ticker.Stop() | |||
| for { | |||
| select { | |||
| case <-ctx.Done(): | |||
| return | |||
| case <-ticker.C: | |||
| if ctx.Err() != nil { | |||
| return | |||
| } | |||
| if _, err := o.Client.Discovery().ServerVersion(); err != nil && !errors.Is(err, context.Canceled) { | |||
| klog.V(5).Infof("keep-alive probe failed (non-fatal): %v", err) | |||
There was a problem hiding this comment.
I would make it like V(2), this is effectively invisible.
|
@tchap sure, I have updated and added proposal and test evidences to the RFE too. |
- Remove internal context.WithCancel; let caller control goroutine lifetime via ctx cancellation (simpler API, no redundant cancel func) - Change keep-alive probe log level from V(5) to V(2) for better visibility during troubleshooting Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
@swshende-cmd: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Add a client-side keep-alive mechanism to
oc adm must-gatherthat prevents OAuth access tokenrevocation due to
accessTokenInactivityTimeouton hardened clusters.Problem: When
accessTokenInactivityTimeoutis configured (e.g., 300s for regulatedenvironments),
oc adm must-gatherfails with "Unauthorized" errors mid-collection. The rootcause is that the
occlient holds a single long-lived HTTP log-follow stream to watch thegather pod. The API server does not count this persistent connection as "activity" for the OAuth
inactivity timer, causing the token to be revoked.
Fix: A background goroutine that calls
Discovery().ServerVersion()(authenticatedGET /version) every 30 seconds during the gather lifecycle. This resets the OAuth inactivity
timer on each call, keeping the session alive for the full duration of the operation.
Resolves: RFE-8595
What Changed
Two additions to
pkg/cli/admin/mustgather/mustgather.go:startClientKeepAlive(ctx)— Background goroutine making periodic authenticated APIprobes (every 30s) to reset the OAuth token inactivity timer.
Integration in
processNextWorkItem()— Started after pod creation, stopped viadeferafter data copy completes. Covers: pod startup wait -> log streaming -> gathercompletion wait -> rsync download.
Test Evidence
Environment
accessTokenInactivityTimeout: 300sinactivityTimeoutSeconds: 300(confirmed on OAuthAccessToken object)sleep 360(6-min idle gap) injected before/usr/bin/gather— forces a window with zero client-side API calls, exceeding the 5-minute timeout
WITHOUT the fix (unpatched oc)
WITH the fix (patched oc - keep-alive enabled)
Comparative Summary
Unit Tests
Use Case
This addresses environments where short
accessTokenInactivityTimeoutvalues (e.g., 5-10minutes) are mandated by regulatory or security compliance policies (Banking, Government,
PCI-DSS). Previously, administrators in such environments could not run
oc adm must-gatherwithout either:
while true; do oc whoami; sleep 60; done) in the backgroundThis fix eliminates both workarounds by building the equivalent heartbeat directly into
oc.Design Decisions
a 5-minute timeout would expire
Discovery().ServerVersion()— Lightest possible authenticated API call; single GET,no list/watch, no resource creation
won't kill the gather
deferHow to Reproduce
Made with Cursor
Summary by CodeRabbit