Session liveness: finalize ACTIVE sessions whose agent has exited#1488
Session liveness: finalize ACTIVE sessions whose agent has exited#1488Soph wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want fixes drafted automatically? Bugbot Autofix can create code changes for findings. A team admin can enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b7e2f3a. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds immediate, process-based session liveness detection so sessions don’t remain ACTIVE indefinitely when an agent exits without emitting a SessionStop hook. This fits into the session/strategy lifecycle by recording an owning-process identity at turn start and sweeping/finalizing exited sessions on entire status / entire doctor.
Changes:
- Introduces
proclivefor capturing/checking a process identity (PID + start fingerprint + host/boot guards). - Extends session state with
OwnerplusOwnerLiveness()/OwnerExited()and wires owner capture into turn start (InitializeSession). - Adds an exited-session sweep (
finalizeExitedSessions) and refactors canonical session-end behavior intoendSessionNow, called by both lifecycle stop and the sweep.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/strategy/owner_wiring_test.go | Strategy wiring test ensuring owner capture at turn start (unix only). |
| cmd/entire/cli/strategy/manual_commit_hooks.go | Captures owner identity on each InitializeSession (turn start). |
| cmd/entire/cli/status.go | Runs exited-session sweep before rendering active sessions; labels exited. |
| cmd/entire/cli/sessions.go | Updates stop flow for new markSessionEnded signature. |
| cmd/entire/cli/session/state.go | Adds Owner field + liveness/exited helpers on session state. |
| cmd/entire/cli/session/owner_test.go | Unit tests for nil-owner and non-active phase behavior. |
| cmd/entire/cli/session/owner_live_test.go | Live tests for OwnerExited (linux/darwin). |
| cmd/entire/cli/session_finalize.go | Implements finalizeExitedSessions sweep and in-memory state mutation. |
| cmd/entire/cli/session_finalize_test.go | Tests the sweep finalization and under-lock revalidation. |
| cmd/entire/cli/proclive/proclive.go | Core identity capture + liveness checking logic. |
| cmd/entire/cli/proclive/proclive_test.go | Unit tests for stringer, empty identity, host mismatch, transient-name detection. |
| cmd/entire/cli/proclive/proclive_live_test.go | Live-process tests for Alive/Dead, start mismatch, and ResolveOwner behavior. |
| cmd/entire/cli/proclive/proc_other.go | Unsupported-platform seam (returns Unknown / no owner). |
| cmd/entire/cli/proclive/proc_other_test.go | Verifies unsupported platforms degrade to Unknown. |
| cmd/entire/cli/proclive/proc_linux.go | Linux /proc implementation + boot ID fallback. |
| cmd/entire/cli/proclive/proc_linux_test.go | Tests adversarial /proc/<pid>/stat parsing. |
| cmd/entire/cli/proclive/proc_darwin.go | Darwin sysctl-based proc stat + boottime. |
| cmd/entire/cli/phase_wiring_test.go | Updates tests for new markSessionEnded return signature. |
| cmd/entire/cli/lifecycle.go | Extracts endSessionNow; adds guardable markSessionEnded. |
| cmd/entire/cli/doctor.go | Runs exited-session sweep up front; classifies exited reason distinctly. |
New leaf package that captures a process's identity (PID + start-time fingerprint, plus host/boot guards) and reports whether that exact process is still alive. ResolveOwner walks up the process tree to the first non-shell, non-entire ancestor (the agent that spawned our hook), skipping the Go toolchain too so local-dev's `go run` wrapper isn't mistaken for the owner; it records no owner at all when the hostname can't be determined, since a PID is only meaningful on its own machine. Check returns Alive/Dead/Unknown — Dead on a missing PID, start-time mismatch (PID reuse), or reboot, and Unknown when it can't confirm the host/boot or the platform can't introspect (Windows), so callers fail closed to a timeout rather than trusting a stale PID. darwin records no boot guard: kern.boottime drifts when the wall clock is stepped (NTP), and darwin's absolute P_starttime already distinguishes a reused PID across reboots; Linux uses ticks-since-boot and keeps the boot_id guard. Stdlib + golang.org/x/sys/unix only, so session/strategy/cli can import it without an import cycle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: f6bcea2919b7
State.Owner stores the proclive.Identity captured at turn start. OwnerLiveness/OwnerExited report when an ACTIVE session's agent process has gone away (clean exit, crash, kill, terminal close, reboot) without a SessionStop hook firing, falling back to the StuckActiveThreshold timeout when liveness is Unknown (no owner, cross-host, unsupported platform). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: f99451d13eb9
InitializeSession records the owning agent process via proclive.ResolveOwner alongside captureSessionBranch, on every turn start. The field is cleared first so a failed resolve never leaves a stale (possibly dead) owner from an earlier turn that would wrongly finalize a now-live session; re-resolving each turn also keeps the fingerprint current across agent restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 1e2b78d2d708
Extract endSessionNow — the markSessionEnded + eager-condense sequence the SessionStop hook already runs — and share it with a new finalizeExitedSessions sweep, so the hook and the sweep stay in lockstep. markSessionEnded gains an optional guard so the sweep re-checks OwnerExited on the freshly-loaded state under the session-state lock, closing a TOCTOU race where a turn could revive a session between the list snapshot and the finalize. entire status (human and --json) and entire doctor run the sweep up front, finalizing any ACTIVE session whose owner process is gone instead of leaving it "active" until the 1h StuckActiveThreshold. After finalizing, the sweep reloads each session from disk so callers see the true post-finalize state (condense is fail-open, so StepCount/ FullyCondensed are never assumed). Both surfaces also label such sessions "exited" (human output, status --json, doctor's stuck-session reason) as a fallback when finalization can't run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 3ee2f4dbe270
b7e2f3a to
439be1e
Compare
|
Pushed fixes for the CI lint failure and all review findings (force-pushed; folded into the relevant commits): CI lint — Cursor Bugbot
Copilot
Trail #583 finding (Medium, macOS boottime/NTP) — darwin no longer records a boot guard; All commits build on darwin/linux/windows; |

https://entire.io/gh/entireio/cli/trails/583
Problem
A session is marked
PhaseActivewhile an agent takes a turn, and the lifecycle relies on aSessionStophook firing to leave ACTIVE. When the agent process goes away mid-turn — a clean/exit/Ctrl-D, a crash, a kill, a closed terminal, or a reboot — no hook fires, so the session is stuck ACTIVE forever. The only prior mitigation wasIsStuckActive(), a coarse 1-hour timeout, which is both too slow (a session whose agent exited 2 minutes ago still shows active for an hour) and imprecise (a genuinely long, still-running turn gets flagged after an hour).Approach
Record the owning agent process's identity at each turn start and detect immediately when that process is gone.
cmd/entire/cli/proclive— new stdlib +x/sys/unixleaf package. Captures{PID, start-time fingerprint, host, boot}by walking up the process tree from the hook to the first non-shell, non-entire(and non-go) ancestor.Check()returns Alive / Dead / Unknown — Dead on a missing PID, start-time mismatch (PID reuse), or reboot; Unknown (fail-closed) when it can't confirm the host/boot or the platform can't introspect (Windows), so callers fall back to the timeout.session.StategainsOwnerplusOwnerLiveness()/OwnerExited().OwnerExited()is true only for an ACTIVE session whose owner is gone.InitializeSession) records the owner alongside the branch, cleared-then-set each turn so a failed resolve never leaves a stale owner and the fingerprint tracks agent restarts.entire statusandentire doctorrun a sharedfinalizeExitedSessionssweep up front, finalizing exited sessions on the spot by replaying the missingSessionStop(PhaseEnded+ condense) via the extractedendSessionNow— the same path the clean-stop hook runs. The sweep re-checksOwnerExited()under the session-state lock to avoid racing a turn that revived the session. Both surfaces also carry anexitedlabel (human,--json, and doctor's stuck-session reason) as a fallback.Terminology is "exited", not "crashed" — it covers a clean exit as much as an abnormal one.
Commits
proclive: add process-liveness packagesession: record owning process and detect exited sessionsstrategy: capture session owner at each turn startstatus, doctor: finalize sessions whose agent has exitedTesting
proclive(live-process Alive/Dead, PID-reuse,/procparsing with parens, unsupported-platform Unknown),OwnerExited, owner capture at turn start, the finalize sweep, and its under-lock revalidation.mise run lint→ 0 issues;mise run test:cigreen (unit + integration + Vogon + external-agent canary); cross-compiles on linux/darwin/windows.entire statusfinalized a planted exited session (phase: ended,fully_condensed: true) immediately rather than after an hour.Reviewed in two passes by Codex; all findings addressed.
Out of scope
entire session list"exited" labeling (easy follow-up reusingOwnerExited).🤖 Generated with Claude Code
Note
Medium Risk
Touches session lifecycle, condensation, and doctor/status auto-finalization; incorrect liveness could end live sessions, though guards and fail-closed Unknown reduce that risk.
Overview
Adds process-based session liveness so ACTIVE sessions are not stuck until the 1-hour inactivity timeout when the agent exits without a
SessionStophook.A new
proclivepackage records the owning agent at each turn start (PID + start fingerprint, host/boot guards) and reports alive/dead/unknown. Session state gainsOwnerplusOwnerExited(); turn start clears then re-captures the owner viacaptureSessionOwner.endSessionNowcentralizes mark-ended + eager condense (shared by SessionStop and the new sweep).markSessionEndedtakes an optional guard and returns whether the session actually ended.finalizeExitedSessionsruns at the start ofentire statusandentire doctor, ending and condensing exited sessions under lock (re-checkingOwnerExitedto avoid races). DoctorclassifySessiontreats exited owners as stuck immediately; status shows an exited label when finalize could not complete.Linux/darwin get real introspection; other platforms degrade to Unknown and keep the timeout fallback.
Reviewed by Cursor Bugbot for commit b7e2f3a. Configure here.