Skip to content

fix: prevent data races and panics in SDK stop/heartbeat handling#457

Merged
joshiste merged 1 commit into
mainfrom
fix/sdk-stop-heartbeat-races
Jul 1, 2026
Merged

fix: prevent data races and panics in SDK stop/heartbeat handling#457
joshiste merged 1 commit into
mainfrom
fix/sdk-stop-heartbeat-races

Conversation

@joshiste

@joshiste joshiste commented Jul 1, 2026

Copy link
Copy Markdown
Member

Problem

A security/code audit of the shared kit libraries found two CRITICAL concurrency bugs in action_kit_sdk — both can crash any of the ~33 extensions that use the SDK, and neither is caught by exthttp.PanicRecovery (they occur in non-recovered goroutines / are data races).

  1. stopEvents slice data race. markAsStopped (append + stopEvents = stopEvents[1:]) and getStopEvent (range) run with no lock from the HTTP stop/status handler goroutines, the heartbeat-timeout goroutine (StopAction) and the signal handler. Concurrent append/reslice/range on a plain slice → data race → torn reads, dropped/duplicated stop events, or slice bounds out of range.

  2. Heartbeat channel double-close / send-on-closed panic. Monitor.Stop closed pulse non-idempotently and RecordHeartbeat did a blocking send. Two stop paths (the HTTP stop handler and the heartbeat-timeout goroutine) can close(pulse) twice → panic: close of closed channel; a concurrent RecordHeartbeat can send on a closed channel → panic. These run in a non-recovered goroutine → the whole extension process dies.

Fix

  • Guard stopEvents with a package sync.Mutex (both accessors).
  • Make heartbeat.Monitor self-guarding: an internal mu+closed flag makes Stop idempotent (close once) and RecordHeartbeat a non-blocking, closed-safe send (never panics, never blocks the HTTP status handler; a beat is dropped only if the buffer is already full).
  • stopMonitorHeartbeat uses sync.Map.LoadAndDelete so only one caller stops a given monitor.

Adds race tests for both paths (go test -race).

Notes

  • The identical two bugs exist in preflight-kit's copied SDK; that will be fixed the same way in a follow-up PR (per-repo, since they are separate modules — extracting a shared module isn't worth the cross-module release coupling for ~40 lines).
  • This is the first of the action-kit audit PRs; the action_kit_commons robustness findings (ociruntime nil-derefs, netfault divide-by-zero, process-wrapper nil-derefs, tc/iptables newline-injection hardening) will follow as a separate PR.

Verification

go build ./..., go vet ./..., and go test -race ./... pass (module go/action_kit_sdk); existing heartbeat tests still pass.

Two concurrency bugs in action_kit_sdk could crash any extension:

- The shared stopEvents slice was appended/resliced/ranged from the HTTP
  stop and status handlers, the heartbeat-timeout goroutine and the signal
  handler with no lock — a data race that could panic or drop stop events.
  Guard it with a mutex.
- heartbeat.Monitor.Stop closed the pulse channel non-idempotently and
  RecordHeartbeat did a blocking send, so two stop paths racing (HTTP stop
  handler vs heartbeat-timeout goroutine) could double-close, and a
  concurrent RecordHeartbeat could send on a closed channel — both panic in
  a non-recovered goroutine, crashing the process. Make Stop idempotent and
  RecordHeartbeat a non-blocking, closed-safe send (guarded by the monitor's
  mutex), and use LoadAndDelete in stopMonitorHeartbeat so only one caller
  stops a given monitor.

Adds race tests for both paths.
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown

Claude finished @joshiste's task in 1m 54s —— View job


Review complete ✅

Todo list:

  • Check previously unresolved review threads for fixes (none exist yet — this is the first review)
  • Gather context (diff, affected files)
  • Review stopEvents mutex fix (action_sdk.go)
  • Review heartbeat Monitor self-guarding fix (heartbeat.go)
  • Review new tests
  • Post final review

Summary

This is a clean, well-scoped concurrency fix. I traced all call sites for both bugs and the fix closes them correctly:

stopEvents race (action_sdk.go:30-31, 282-304)

  • stopEventsMu now guards both markAsStopped (append/reslice) and getStopEvent (range) — the only two mutators/readers of the package-level slice.
  • getStopEvent returns &event from inside the loop, but since it's a copy taken under the lock and returned immediately (not captured in a closure or stored across iterations), this is safe regardless of Go version's loop-variable semantics.
  • Confirmed via grep that getStopEvent's two callers (action_http_adapter.go:288,372) only dereference the result immediately — no aliasing/lifetime issue.

Heartbeat double-close / send-on-closed (heartbeat.go)

  • Monitor.mu + closed flag makes Stop() idempotent (checked-and-set under lock before close).
  • RecordHeartbeat() checks closed under the same lock before attempting a send, and the send itself is non-blocking (select/default), so it can neither panic on a closed channel nor block the HTTP status-handler goroutine when the buffer is full.
  • stopMonitorHeartbeat switched from Load+Delete to LoadAndDelete, so under a concurrent stop-handler-vs-timeout-goroutine race only one caller obtains the monitor to stop — though this is now a belt-and-suspenders improvement rather than strictly required, since Stop() is idempotent on its own.
  • recordHeartbeat (action_sdk.go:266-271) still uses plain Load, which is correct: even if the monitor was concurrently deleted from the map, the *heartbeat.Monitor object itself remains valid and self-guards via closed.
  • The inner goroutine started in Notify closes the caller-supplied ch (not pulse) exactly once per goroutine lifetime on either the pulse-closed or timeout branch — unaffected by this change and not a double-close risk.

TestsTestStopEvents_concurrent_access and TestMonitor_concurrent_stop_and_record directly exercise the two race/panic scenarios described in the PR body (concurrent append/range, and concurrent/repeated Stop() + RecordHeartbeat()). Both are meaningful regression tests for -race.

No bugs, security issues, or leftover race conditions found. No inline comments needed — the changes are small, correct, and match the stated intent.

Note: I was not able to run go build/go vet/go test -race locally (shell command execution was sandboxed/denied in this environment), so I relied on static analysis and tracing call sites. The PR description states these all pass; if you'd like me to actually execute them, --allowedTools would need to permit Bash(go *).

@sonarqubecloud

sonarqubecloud Bot commented Jul 1, 2026

Copy link
Copy Markdown

@joshiste joshiste merged commit 93625ea into main Jul 1, 2026
11 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant