Skip to content

Fix concurrent SetSecret calls silently clobbering each other#4345

Open
JAORMX wants to merge 2 commits intomainfrom
fix-encrypted-secrets-toctou
Open

Fix concurrent SetSecret calls silently clobbering each other#4345
JAORMX wants to merge 2 commits intomainfrom
fix-encrypted-secrets-toctou

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Mar 24, 2026

Summary

  • EncryptedManager had a TOCTOU race: it loaded the secrets file at construction time and wrote that stale snapshot back under the file lock, silently overwriting changes from other processes. In practice, OAuth token refreshes from long-running proxy processes would clobber secrets set by concurrent thv secret set CLI invocations (and vice versa), causing ~8% secret loss under contention.
  • Mutations (SetSecret, DeleteSecret, Cleanup) now re-read the file from disk inside the critical section before applying changes, eliminating the stale-snapshot problem.
  • Added a per-path in-process sync.Mutex to WithFileLock because flock(2) does not provide mutual exclusion between different file descriptors within the same process.

Fixes #4339

Type of change

  • Bug fix

Test plan

  • Unit tests (task test) — existing TestEncryptedManager_Concurrency now passes reliably (previously flaky); ran 50 iterations with zero failures
  • Manual testing — verified go vet passes on both changed packages

Changes

File Change
pkg/secrets/encrypted.go SetSecret/DeleteSecret/Cleanup now read-modify-write inside the lock instead of using a stale in-memory snapshot; extracted readFileSecrets/writeFileSecrets helpers; simplified NewEncryptedManager to reuse readFileSecrets
pkg/fileutils/lock.go Added per-path sync.Mutex registry (processLocks) so WithFileLock serializes both across processes (flock) and within the same process (mutex)

Does this introduce a user-facing change?

Yes — thv secret set will no longer silently lose secrets when other processes (e.g. OAuth token refreshes) write to the secrets file concurrently. Users who previously needed retry-based workarounds (#4339) should no longer experience secret loss.

Special notes for reviewers

  • The in-memory syncmap.Map cache is now updated with targeted Store/Delete calls (not a full cache replacement) to avoid a window where concurrent GetSecret reads see an empty cache. This means the cache may not reflect keys added by other processes, but that's acceptable: CLI one-shots create a fresh manager per invocation, and long-running proxies only need their own tokens.
  • readFileSecrets handles empty files and missing files gracefully, returning an empty map rather than erroring.
  • The DeleteSecret existence check moved inside the lock and now checks the on-disk state, not the potentially-stale in-memory cache.

Generated with Claude Code

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 24, 2026
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 46.96970% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.38%. Comparing base (074326e) to head (beb1198).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/secrets/encrypted.go 42.10% 11 Missing and 22 partials ⚠️
pkg/fileutils/lock.go 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4345      +/-   ##
==========================================
- Coverage   68.45%   68.38%   -0.08%     
==========================================
  Files         479      479              
  Lines       48642    48669      +27     
==========================================
- Hits        33300    33281      -19     
- Misses      12373    12390      +17     
- Partials     2969     2998      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX requested a review from amirejaz March 24, 2026 14:48
Copy link
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix — the read-modify-write pattern inside the lock and the processLocks in-process mutex address the root cause cleanly. The Stat() removal is correct: readFileSecrets replaces stat.Size() > 0 with len(data) == 0 after os.ReadFile, and also handles the non-existent file case. A few gaps below.

return err
}
if _, ok := secrets[name]; !ok {
return fmt.Errorf("cannot delete non-existent secret: %s", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale cache entry is not evicted on the not-found path.

Because GetSecret reads only from e.secrets (in-memory cache) and DeleteSecret now reads from disk, the two can diverge:

  1. Manager loads — e.secrets has "mykey"
  2. Another process deletes "mykey" from disk
  3. Caller calls GetSecret("mykey")succeeds (stale cache)
  4. Caller calls DeleteSecret("mykey")readFileSecrets finds it gone → "cannot delete non-existent secret"

The caller just saw the key but cannot delete it. Worse, the stale entry is never cleaned up — every subsequent GetSecret keeps returning a value that no longer exists on disk.

The fix is a one-liner: evict the stale cache entry before returning the error:

Suggested change
return fmt.Errorf("cannot delete non-existent secret: %s", name)
if _, ok := secrets[name]; !ok {
e.secrets.Delete(name) // evict stale cache entry discovered to be gone on disk
return fmt.Errorf("cannot delete non-existent secret: %s", name)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in beb1198. DeleteSecret now evicts the stale cache entry before returning the error.

if err := e.writeFileSecrets(secrets); err != nil {
return err
}
e.secrets.Delete(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache update lags behind the disk write. Between writeFileSecrets succeeding (line 88–90) and e.secrets.Delete(name) executing here, a concurrent GetSecret in the same process can still return the deleted value. The same window exists in SetSecret in the opposite direction (key written to disk but not yet in cache).

Both orderings are a deliberate trade-off (disk is the source of truth; the cache is a best-effort read optimisation). Worth a brief comment here and in SetSecret so future contributors don't try to "fix" the ordering:

// Update the in-memory cache after the disk write. There is a brief window
// where a concurrent GetSecret may return a stale value; this is acceptable
// because the file is the authoritative source of truth.
e.secrets.Delete(name)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment in beb1198 — both SetSecret and DeleteSecret now document the brief cache-lag window and why it's acceptable.

// readFileSecrets reads and decrypts the secrets file, returning the current
// on-disk secrets. Returns an empty map for an empty or non-existent file.
// Must be called while holding the file lock.
func (e *EncryptedManager) readFileSecrets() (map[string]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetSecret and ListSecrets still read from the stale in-memory cache (neither method is changed by this PR). A long-running proxy whose peer process wrote a new secret after construction will never see it via GetSecret.

This is acknowledged in the PR description as acceptable, but there is no code-level comment capturing the design decision. Without one, a future contributor might "fix" this by re-reading from disk on every GetSecret call — which would bypass the cache and add per-call decrypt overhead.

Suggestion: add a comment on the struct or GetSecret documenting the intentional split:

// GetSecret and ListSecrets read from the in-memory cache only and do not
// re-read the file. They may not reflect secrets written by other processes
// after construction. This is intentional: CLI invocations create a fresh
// manager per call, and long-running proxies only need their own tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the design comment on GetSecret in beb1198 — documents the intentional cache-only read and the reasoning (CLI one-shots vs long-running proxies).

// Load the initial snapshot into the in-memory cache.
secrets, err := manager.readFileSecrets()
if err != nil {
if strings.Contains(err.Error(), "unable to decrypt") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Contains on error messages is fragile. If the wrapper message in readFileSecrets ever changes (e.g. "unable to decrypt""failed to decrypt"), the user-facing guidance is silently lost with no compile-time or test-time signal.

The original code matched "message authentication failed" from the underlying crypto library, which was even more brittle. This PR improves it by matching our own wrapper, but it's still string-matching.

The right fix is a sentinel error in the aes package:

// pkg/secrets/aes
var ErrDecryptionFailed = errors.New("decryption failed")

Then the check becomes errors.Is(err, aes.ErrDecryptionFailed), which is refactor-safe. Tracking this as a follow-up would be worthwhile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — a sentinel error in the aes package would be the right fix. Tracking as a follow-up to keep this PR focused on the TOCTOU race.

// WithFileLock opens a new file descriptor, concurrent goroutines can all acquire
// the flock simultaneously. This in-process mutex ensures serialization within a
// single process, while the flock continues to protect cross-process access.
var processLocks sync.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processLocks grows unbounded. Entries are stored forever — one *sync.Mutex per distinct path, never evicted. For CLI one-shots and a fixed secrets file path this is harmless, but for any long-running process that calls WithFileLock over many distinct paths it is a minor memory leak.

At minimum document the trade-off in the comment:

This map is never purged; callers should ensure the number of distinct paths remains bounded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented the trade-off in beb1198 — the comment now notes the map is never pruned and callers should keep the set of distinct paths bounded.

// getProcessLock returns the in-process mutex for the given path,
// creating one if it does not already exist.
func getProcessLock(path string) *sync.Mutex {
val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadOrStore allocates a new *sync.Mutex{} on every call, even when the key already exists (the surplus mutex is immediately discarded). In the steady state (paths already registered) this is a GC allocation per WithFileLock call. A Load-first pattern avoids it:

Suggested change
val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})
func getProcessLock(path string) *sync.Mutex {
if val, ok := processLocks.Load(path); ok {
return val.(*sync.Mutex)
}
val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})
return val.(*sync.Mutex)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — fixed in beb1198. getProcessLock now uses a Load-first fast path to avoid allocating a new mutex on every call in steady state.

JAORMX and others added 2 commits March 25, 2026 05:49
EncryptedManager contained a TOCTOU race: it loaded the secrets file
into an in-memory map at construction time and later wrote the stale
map back to disk under the file lock, silently overwriting changes made
by other processes (e.g. OAuth token refreshes running in background
proxy processes).

The fix has two parts:

1. Read-modify-write inside the lock (pkg/secrets/encrypted.go):
   SetSecret, DeleteSecret, and Cleanup now re-read and decrypt the
   file from disk inside the critical section before applying their
   mutation and writing back. This eliminates the stale-snapshot
   problem across both processes and goroutines.

2. In-process mutex in WithFileLock (pkg/fileutils/lock.go):
   flock(2) does not provide mutual exclusion between different file
   descriptors within the same process. Added a per-path sync.Mutex
   that serializes goroutines before acquiring the flock, so the
   cross-process lock remains effective for its intended purpose.

Fixes #4339

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Evict stale cache entry in DeleteSecret when key is gone from disk
- Add design comments on cache-lag windows in SetSecret/DeleteSecret
- Document GetSecret/ListSecrets intentional stale-cache behavior
- Document processLocks unbounded growth trade-off
- Use Load-first pattern in getProcessLock to avoid steady-state allocs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the fix-encrypted-secrets-toctou branch from a3c8eb0 to beb1198 Compare March 25, 2026 05:50
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent SetSecret calls silently clobber each other (TOCTOU)

2 participants