Fix concurrent SetSecret calls silently clobbering each other by JAORMX · Pull Request #4345 · stacklok/toolhive

JAORMX · 2026-03-24T14:36:00Z

Summary

EncryptedManager had a TOCTOU race: it loaded the secrets file at construction time and wrote that stale snapshot back under the file lock, silently overwriting changes from other processes. In practice, OAuth token refreshes from long-running proxy processes would clobber secrets set by concurrent thv secret set CLI invocations (and vice versa), causing ~8% secret loss under contention.
Mutations (SetSecret, DeleteSecret, Cleanup) now re-read the file from disk inside the critical section before applying changes, eliminating the stale-snapshot problem.
Added a per-path in-process sync.Mutex to WithFileLock because flock(2) does not provide mutual exclusion between different file descriptors within the same process.

Fixes #4339

Type of change

Bug fix

Test plan

Unit tests (task test) — existing TestEncryptedManager_Concurrency now passes reliably (previously flaky); ran 50 iterations with zero failures
Manual testing — verified go vet passes on both changed packages

Changes

File	Change
`pkg/secrets/encrypted.go`	`SetSecret`/`DeleteSecret`/`Cleanup` now read-modify-write inside the lock instead of using a stale in-memory snapshot; extracted `readFileSecrets`/`writeFileSecrets` helpers; simplified `NewEncryptedManager` to reuse `readFileSecrets`
`pkg/fileutils/lock.go`	Added per-path `sync.Mutex` registry (`processLocks`) so `WithFileLock` serializes both across processes (flock) and within the same process (mutex)

Does this introduce a user-facing change?

Yes — thv secret set will no longer silently lose secrets when other processes (e.g. OAuth token refreshes) write to the secrets file concurrently. Users who previously needed retry-based workarounds (#4339) should no longer experience secret loss.

Special notes for reviewers

The in-memory syncmap.Map cache is now updated with targeted Store/Delete calls (not a full cache replacement) to avoid a window where concurrent GetSecret reads see an empty cache. This means the cache may not reflect keys added by other processes, but that's acceptable: CLI one-shots create a fresh manager per invocation, and long-running proxies only need their own tokens.
readFileSecrets handles empty files and missing files gracefully, returning an empty map rather than erroring.
The DeleteSecret existence check moved inside the lock and now checks the on-disk state, not the potentially-stale in-memory cache.

Generated with Claude Code

codecov · 2026-03-24T14:40:30Z

Codecov Report

❌ Patch coverage is 46.96970% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.38%. Comparing base (074326e) to head (beb1198).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/secrets/encrypted.go	42.10%	11 Missing and 22 partials ⚠️
pkg/fileutils/lock.go	77.77%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4345      +/-   ##
==========================================
- Coverage   68.45%   68.38%   -0.08%     
==========================================
  Files         479      479              
  Lines       48642    48669      +27     
==========================================
- Hits        33300    33281      -19     
- Misses      12373    12390      +17     
- Partials     2969     2998      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

amirejaz

Solid fix — the read-modify-write pattern inside the lock and the processLocks in-process mutex address the root cause cleanly. The Stat() removal is correct: readFileSecrets replaces stat.Size() > 0 with len(data) == 0 after os.ReadFile, and also handles the non-existent file case. A few gaps below.

amirejaz · 2026-03-24T17:05:41Z

pkg/secrets/encrypted.go

+			return err
+		}
+		if _, ok := secrets[name]; !ok {
+			return fmt.Errorf("cannot delete non-existent secret: %s", name)


Stale cache entry is not evicted on the not-found path.

Because GetSecret reads only from e.secrets (in-memory cache) and DeleteSecret now reads from disk, the two can diverge:

Manager loads — e.secrets has "mykey"

Another process deletes "mykey" from disk

Caller calls GetSecret("mykey") → succeeds (stale cache)

Caller calls DeleteSecret("mykey") → readFileSecrets finds it gone → "cannot delete non-existent secret"

The caller just saw the key but cannot delete it. Worse, the stale entry is never cleaned up — every subsequent GetSecret keeps returning a value that no longer exists on disk.

The fix is a one-liner: evict the stale cache entry before returning the error:

Suggested change

return fmt.Errorf("cannot delete non-existent secret: %s", name)

if _, ok := secrets[name]; !ok {

e.secrets.Delete(name) // evict stale cache entry discovered to be gone on disk

return fmt.Errorf("cannot delete non-existent secret: %s", name)

}

Good catch — fixed in beb1198. DeleteSecret now evicts the stale cache entry before returning the error.

amirejaz · 2026-03-24T17:05:41Z

pkg/secrets/encrypted.go

+		if err := e.writeFileSecrets(secrets); err != nil {
+			return err
+		}
 		e.secrets.Delete(name)


Cache update lags behind the disk write. Between writeFileSecrets succeeding (line 88–90) and e.secrets.Delete(name) executing here, a concurrent GetSecret in the same process can still return the deleted value. The same window exists in SetSecret in the opposite direction (key written to disk but not yet in cache).

Both orderings are a deliberate trade-off (disk is the source of truth; the cache is a best-effort read optimisation). Worth a brief comment here and in SetSecret so future contributors don't try to "fix" the ordering:

// Update the in-memory cache after the disk write. There is a brief window // where a concurrent GetSecret may return a stale value; this is acceptable // because the file is the authoritative source of truth. e.secrets.Delete(name)

Added the comment in beb1198 — both SetSecret and DeleteSecret now document the brief cache-lag window and why it's acceptable.

amirejaz · 2026-03-24T17:05:42Z

pkg/secrets/encrypted.go

+// readFileSecrets reads and decrypts the secrets file, returning the current
+// on-disk secrets. Returns an empty map for an empty or non-existent file.
+// Must be called while holding the file lock.
+func (e *EncryptedManager) readFileSecrets() (map[string]string, error) {


GetSecret and ListSecrets still read from the stale in-memory cache (neither method is changed by this PR). A long-running proxy whose peer process wrote a new secret after construction will never see it via GetSecret.

This is acknowledged in the PR description as acceptable, but there is no code-level comment capturing the design decision. Without one, a future contributor might "fix" this by re-reading from disk on every GetSecret call — which would bypass the cache and add per-call decrypt overhead.

Suggestion: add a comment on the struct or GetSecret documenting the intentional split:

// GetSecret and ListSecrets read from the in-memory cache only and do not // re-read the file. They may not reflect secrets written by other processes // after construction. This is intentional: CLI invocations create a fresh // manager per call, and long-running proxies only need their own tokens.

Added the design comment on GetSecret in beb1198 — documents the intentional cache-only read and the reasoning (CLI one-shots vs long-running proxies).

amirejaz · 2026-03-24T17:05:42Z

pkg/secrets/encrypted.go

+	// Load the initial snapshot into the in-memory cache.
+	secrets, err := manager.readFileSecrets()
+	if err != nil {
+		if strings.Contains(err.Error(), "unable to decrypt") {


strings.Contains on error messages is fragile. If the wrapper message in readFileSecrets ever changes (e.g. "unable to decrypt" → "failed to decrypt"), the user-facing guidance is silently lost with no compile-time or test-time signal.

The original code matched "message authentication failed" from the underlying crypto library, which was even more brittle. This PR improves it by matching our own wrapper, but it's still string-matching.

The right fix is a sentinel error in the aes package:

// pkg/secrets/aes var ErrDecryptionFailed = errors.New("decryption failed")

Then the check becomes errors.Is(err, aes.ErrDecryptionFailed), which is refactor-safe. Tracking this as a follow-up would be worthwhile.

Agreed — a sentinel error in the aes package would be the right fix. Tracking as a follow-up to keep this PR focused on the TOCTOU race.

amirejaz · 2026-03-24T17:05:42Z

pkg/fileutils/lock.go

+// WithFileLock opens a new file descriptor, concurrent goroutines can all acquire
+// the flock simultaneously. This in-process mutex ensures serialization within a
+// single process, while the flock continues to protect cross-process access.
+var processLocks sync.Map


processLocks grows unbounded. Entries are stored forever — one *sync.Mutex per distinct path, never evicted. For CLI one-shots and a fixed secrets file path this is harmless, but for any long-running process that calls WithFileLock over many distinct paths it is a minor memory leak.

At minimum document the trade-off in the comment:

This map is never purged; callers should ensure the number of distinct paths remains bounded.

Documented the trade-off in beb1198 — the comment now notes the map is never pruned and callers should keep the set of distinct paths bounded.

amirejaz · 2026-03-24T17:05:42Z

pkg/fileutils/lock.go

+// getProcessLock returns the in-process mutex for the given path,
+// creating one if it does not already exist.
+func getProcessLock(path string) *sync.Mutex {
+	val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})


LoadOrStore allocates a new *sync.Mutex{} on every call, even when the key already exists (the surplus mutex is immediately discarded). In the steady state (paths already registered) this is a GC allocation per WithFileLock call. A Load-first pattern avoids it:

Suggested change

val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})

func getProcessLock(path string) *sync.Mutex {

if val, ok := processLocks.Load(path); ok {

return val.(*sync.Mutex)

}

val, _ := processLocks.LoadOrStore(path, &sync.Mutex{})

return val.(*sync.Mutex)

}

Good call — fixed in beb1198. getProcessLock now uses a Load-first fast path to avoid allocating a new mutex on every call in steady state.

EncryptedManager contained a TOCTOU race: it loaded the secrets file into an in-memory map at construction time and later wrote the stale map back to disk under the file lock, silently overwriting changes made by other processes (e.g. OAuth token refreshes running in background proxy processes). The fix has two parts: 1. Read-modify-write inside the lock (pkg/secrets/encrypted.go): SetSecret, DeleteSecret, and Cleanup now re-read and decrypt the file from disk inside the critical section before applying their mutation and writing back. This eliminates the stale-snapshot problem across both processes and goroutines. 2. In-process mutex in WithFileLock (pkg/fileutils/lock.go): flock(2) does not provide mutual exclusion between different file descriptors within the same process. Added a per-path sync.Mutex that serializes goroutines before acquiring the flock, so the cross-process lock remains effective for its intended purpose. Fixes #4339 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Evict stale cache entry in DeleteSecret when key is gone from disk - Add design comments on cache-lag windows in SetSecret/DeleteSecret - Document GetSecret/ListSecrets intentional stale-cache behavior - Document processLocks unbounded growth trade-off - Use Load-first pattern in getProcessLock to avoid steady-state allocs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JAORMX requested review from ChrisJBurns, jhrozek and yrobla as code owners March 24, 2026 14:36

github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 24, 2026

JAORMX requested a review from amirejaz March 24, 2026 14:48

amirejaz reviewed Mar 24, 2026

View reviewed changes

JAORMX and others added 2 commits March 25, 2026 05:49

JAORMX force-pushed the fix-encrypted-secrets-toctou branch from a3c8eb0 to beb1198 Compare March 25, 2026 05:50

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 25, 2026

Conversation

JAORMX commented Mar 24, 2026

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 24, 2026 •

edited

Loading