Fix concurrent SetSecret calls silently clobbering each other#4345
Fix concurrent SetSecret calls silently clobbering each other#4345
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4345 +/- ##
==========================================
- Coverage 68.45% 68.38% -0.08%
==========================================
Files 479 479
Lines 48642 48669 +27
==========================================
- Hits 33300 33281 -19
- Misses 12373 12390 +17
- Partials 2969 2998 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
amirejaz
left a comment
There was a problem hiding this comment.
Solid fix — the read-modify-write pattern inside the lock and the processLocks in-process mutex address the root cause cleanly. The Stat() removal is correct: readFileSecrets replaces stat.Size() > 0 with len(data) == 0 after os.ReadFile, and also handles the non-existent file case. A few gaps below.
| return err | ||
| } | ||
| if _, ok := secrets[name]; !ok { | ||
| return fmt.Errorf("cannot delete non-existent secret: %s", name) |
There was a problem hiding this comment.
Stale cache entry is not evicted on the not-found path.
Because GetSecret reads only from e.secrets (in-memory cache) and DeleteSecret now reads from disk, the two can diverge:
- Manager loads —
e.secretshas"mykey" - Another process deletes
"mykey"from disk - Caller calls
GetSecret("mykey")→ succeeds (stale cache) - Caller calls
DeleteSecret("mykey")→readFileSecretsfinds it gone → "cannot delete non-existent secret"
The caller just saw the key but cannot delete it. Worse, the stale entry is never cleaned up — every subsequent GetSecret keeps returning a value that no longer exists on disk.
The fix is a one-liner: evict the stale cache entry before returning the error:
| return fmt.Errorf("cannot delete non-existent secret: %s", name) | |
| if _, ok := secrets[name]; !ok { | |
| e.secrets.Delete(name) // evict stale cache entry discovered to be gone on disk | |
| return fmt.Errorf("cannot delete non-existent secret: %s", name) | |
| } |
There was a problem hiding this comment.
Good catch — fixed in beb1198. DeleteSecret now evicts the stale cache entry before returning the error.
| if err := e.writeFileSecrets(secrets); err != nil { | ||
| return err | ||
| } | ||
| e.secrets.Delete(name) |
There was a problem hiding this comment.
Cache update lags behind the disk write. Between writeFileSecrets succeeding (line 88–90) and e.secrets.Delete(name) executing here, a concurrent GetSecret in the same process can still return the deleted value. The same window exists in SetSecret in the opposite direction (key written to disk but not yet in cache).
Both orderings are a deliberate trade-off (disk is the source of truth; the cache is a best-effort read optimisation). Worth a brief comment here and in SetSecret so future contributors don't try to "fix" the ordering:
// Update the in-memory cache after the disk write. There is a brief window
// where a concurrent GetSecret may return a stale value; this is acceptable
// because the file is the authoritative source of truth.
e.secrets.Delete(name)There was a problem hiding this comment.
Added the comment in beb1198 — both SetSecret and DeleteSecret now document the brief cache-lag window and why it's acceptable.
| // readFileSecrets reads and decrypts the secrets file, returning the current | ||
| // on-disk secrets. Returns an empty map for an empty or non-existent file. | ||
| // Must be called while holding the file lock. | ||
| func (e *EncryptedManager) readFileSecrets() (map[string]string, error) { |
There was a problem hiding this comment.
GetSecret and ListSecrets still read from the stale in-memory cache (neither method is changed by this PR). A long-running proxy whose peer process wrote a new secret after construction will never see it via GetSecret.
This is acknowledged in the PR description as acceptable, but there is no code-level comment capturing the design decision. Without one, a future contributor might "fix" this by re-reading from disk on every GetSecret call — which would bypass the cache and add per-call decrypt overhead.
Suggestion: add a comment on the struct or GetSecret documenting the intentional split:
// GetSecret and ListSecrets read from the in-memory cache only and do not
// re-read the file. They may not reflect secrets written by other processes
// after construction. This is intentional: CLI invocations create a fresh
// manager per call, and long-running proxies only need their own tokens.There was a problem hiding this comment.
Added the design comment on GetSecret in beb1198 — documents the intentional cache-only read and the reasoning (CLI one-shots vs long-running proxies).
| // Load the initial snapshot into the in-memory cache. | ||
| secrets, err := manager.readFileSecrets() | ||
| if err != nil { | ||
| if strings.Contains(err.Error(), "unable to decrypt") { |
There was a problem hiding this comment.
strings.Contains on error messages is fragile. If the wrapper message in readFileSecrets ever changes (e.g. "unable to decrypt" → "failed to decrypt"), the user-facing guidance is silently lost with no compile-time or test-time signal.
The original code matched "message authentication failed" from the underlying crypto library, which was even more brittle. This PR improves it by matching our own wrapper, but it's still string-matching.
The right fix is a sentinel error in the aes package:
// pkg/secrets/aes
var ErrDecryptionFailed = errors.New("decryption failed")Then the check becomes errors.Is(err, aes.ErrDecryptionFailed), which is refactor-safe. Tracking this as a follow-up would be worthwhile.
There was a problem hiding this comment.
Agreed — a sentinel error in the aes package would be the right fix. Tracking as a follow-up to keep this PR focused on the TOCTOU race.
| // WithFileLock opens a new file descriptor, concurrent goroutines can all acquire | ||
| // the flock simultaneously. This in-process mutex ensures serialization within a | ||
| // single process, while the flock continues to protect cross-process access. | ||
| var processLocks sync.Map |
There was a problem hiding this comment.
processLocks grows unbounded. Entries are stored forever — one *sync.Mutex per distinct path, never evicted. For CLI one-shots and a fixed secrets file path this is harmless, but for any long-running process that calls WithFileLock over many distinct paths it is a minor memory leak.
At minimum document the trade-off in the comment:
This map is never purged; callers should ensure the number of distinct paths remains bounded.
There was a problem hiding this comment.
Documented the trade-off in beb1198 — the comment now notes the map is never pruned and callers should keep the set of distinct paths bounded.
| // getProcessLock returns the in-process mutex for the given path, | ||
| // creating one if it does not already exist. | ||
| func getProcessLock(path string) *sync.Mutex { | ||
| val, _ := processLocks.LoadOrStore(path, &sync.Mutex{}) |
There was a problem hiding this comment.
LoadOrStore allocates a new *sync.Mutex{} on every call, even when the key already exists (the surplus mutex is immediately discarded). In the steady state (paths already registered) this is a GC allocation per WithFileLock call. A Load-first pattern avoids it:
| val, _ := processLocks.LoadOrStore(path, &sync.Mutex{}) | |
| func getProcessLock(path string) *sync.Mutex { | |
| if val, ok := processLocks.Load(path); ok { | |
| return val.(*sync.Mutex) | |
| } | |
| val, _ := processLocks.LoadOrStore(path, &sync.Mutex{}) | |
| return val.(*sync.Mutex) | |
| } |
There was a problem hiding this comment.
Good call — fixed in beb1198. getProcessLock now uses a Load-first fast path to avoid allocating a new mutex on every call in steady state.
EncryptedManager contained a TOCTOU race: it loaded the secrets file into an in-memory map at construction time and later wrote the stale map back to disk under the file lock, silently overwriting changes made by other processes (e.g. OAuth token refreshes running in background proxy processes). The fix has two parts: 1. Read-modify-write inside the lock (pkg/secrets/encrypted.go): SetSecret, DeleteSecret, and Cleanup now re-read and decrypt the file from disk inside the critical section before applying their mutation and writing back. This eliminates the stale-snapshot problem across both processes and goroutines. 2. In-process mutex in WithFileLock (pkg/fileutils/lock.go): flock(2) does not provide mutual exclusion between different file descriptors within the same process. Added a per-path sync.Mutex that serializes goroutines before acquiring the flock, so the cross-process lock remains effective for its intended purpose. Fixes #4339 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Evict stale cache entry in DeleteSecret when key is gone from disk - Add design comments on cache-lag windows in SetSecret/DeleteSecret - Document GetSecret/ListSecrets intentional stale-cache behavior - Document processLocks unbounded growth trade-off - Use Load-first pattern in getProcessLock to avoid steady-state allocs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a3c8eb0 to
beb1198
Compare
Summary
EncryptedManagerhad a TOCTOU race: it loaded the secrets file at construction time and wrote that stale snapshot back under the file lock, silently overwriting changes from other processes. In practice, OAuth token refreshes from long-running proxy processes would clobber secrets set by concurrentthv secret setCLI invocations (and vice versa), causing ~8% secret loss under contention.SetSecret,DeleteSecret,Cleanup) now re-read the file from disk inside the critical section before applying changes, eliminating the stale-snapshot problem.sync.MutextoWithFileLockbecauseflock(2)does not provide mutual exclusion between different file descriptors within the same process.Fixes #4339
Type of change
Test plan
task test) — existingTestEncryptedManager_Concurrencynow passes reliably (previously flaky); ran 50 iterations with zero failuresgo vetpasses on both changed packagesChanges
pkg/secrets/encrypted.goSetSecret/DeleteSecret/Cleanupnow read-modify-write inside the lock instead of using a stale in-memory snapshot; extractedreadFileSecrets/writeFileSecretshelpers; simplifiedNewEncryptedManagerto reusereadFileSecretspkg/fileutils/lock.gosync.Mutexregistry (processLocks) soWithFileLockserializes both across processes (flock) and within the same process (mutex)Does this introduce a user-facing change?
Yes —
thv secret setwill no longer silently lose secrets when other processes (e.g. OAuth token refreshes) write to the secrets file concurrently. Users who previously needed retry-based workarounds (#4339) should no longer experience secret loss.Special notes for reviewers
syncmap.Mapcache is now updated with targetedStore/Deletecalls (not a full cache replacement) to avoid a window where concurrentGetSecretreads see an empty cache. This means the cache may not reflect keys added by other processes, but that's acceptable: CLI one-shots create a fresh manager per invocation, and long-running proxies only need their own tokens.readFileSecretshandles empty files and missing files gracefully, returning an empty map rather than erroring.DeleteSecretexistence check moved inside the lock and now checks the on-disk state, not the potentially-stale in-memory cache.Generated with Claude Code