Skip to content

Session liveness: finalize ACTIVE sessions whose agent has exited#1488

Open
Soph wants to merge 4 commits into
mainfrom
soph/session-liveness-pid
Open

Session liveness: finalize ACTIVE sessions whose agent has exited#1488
Soph wants to merge 4 commits into
mainfrom
soph/session-liveness-pid

Conversation

@Soph

@Soph Soph commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

https://entire.io/gh/entireio/cli/trails/583

Problem

A session is marked PhaseActive while an agent takes a turn, and the lifecycle relies on a SessionStop hook firing to leave ACTIVE. When the agent process goes away mid-turn — a clean /exit/Ctrl-D, a crash, a kill, a closed terminal, or a reboot — no hook fires, so the session is stuck ACTIVE forever. The only prior mitigation was IsStuckActive(), a coarse 1-hour timeout, which is both too slow (a session whose agent exited 2 minutes ago still shows active for an hour) and imprecise (a genuinely long, still-running turn gets flagged after an hour).

Approach

Record the owning agent process's identity at each turn start and detect immediately when that process is gone.

  • cmd/entire/cli/proclive — new stdlib + x/sys/unix leaf package. Captures {PID, start-time fingerprint, host, boot} by walking up the process tree from the hook to the first non-shell, non-entire (and non-go) ancestor. Check() returns Alive / Dead / Unknown — Dead on a missing PID, start-time mismatch (PID reuse), or reboot; Unknown (fail-closed) when it can't confirm the host/boot or the platform can't introspect (Windows), so callers fall back to the timeout.
  • session.State gains Owner plus OwnerLiveness()/OwnerExited(). OwnerExited() is true only for an ACTIVE session whose owner is gone.
  • Turn start (InitializeSession) records the owner alongside the branch, cleared-then-set each turn so a failed resolve never leaves a stale owner and the fingerprint tracks agent restarts.
  • entire status and entire doctor run a shared finalizeExitedSessions sweep up front, finalizing exited sessions on the spot by replaying the missing SessionStop (PhaseEnded + condense) via the extracted endSessionNow — the same path the clean-stop hook runs. The sweep re-checks OwnerExited() under the session-state lock to avoid racing a turn that revived the session. Both surfaces also carry an exited label (human, --json, and doctor's stuck-session reason) as a fallback.

Terminology is "exited", not "crashed" — it covers a clean exit as much as an abnormal one.

Commits

  1. proclive: add process-liveness package
  2. session: record owning process and detect exited sessions
  3. strategy: capture session owner at each turn start
  4. status, doctor: finalize sessions whose agent has exited

Testing

  • Unit tests for proclive (live-process Alive/Dead, PID-reuse, /proc parsing with parens, unsupported-platform Unknown), OwnerExited, owner capture at turn start, the finalize sweep, and its under-lock revalidation.
  • mise run lint → 0 issues; mise run test:ci green (unit + integration + Vogon + external-agent canary); cross-compiles on linux/darwin/windows.
  • Live smoke: entire status finalized a planted exited session (phase: ended, fully_condensed: true) immediately rather than after an hour.

Reviewed in two passes by Codex; all findings addressed.

Out of scope

  • Windows process liveness (degrades to the timeout).
  • entire session list "exited" labeling (easy follow-up reusing OwnerExited).

🤖 Generated with Claude Code


Note

Medium Risk
Touches session lifecycle, condensation, and doctor/status auto-finalization; incorrect liveness could end live sessions, though guards and fail-closed Unknown reduce that risk.

Overview
Adds process-based session liveness so ACTIVE sessions are not stuck until the 1-hour inactivity timeout when the agent exits without a SessionStop hook.

A new proclive package records the owning agent at each turn start (PID + start fingerprint, host/boot guards) and reports alive/dead/unknown. Session state gains Owner plus OwnerExited(); turn start clears then re-captures the owner via captureSessionOwner.

endSessionNow centralizes mark-ended + eager condense (shared by SessionStop and the new sweep). markSessionEnded takes an optional guard and returns whether the session actually ended.

finalizeExitedSessions runs at the start of entire status and entire doctor, ending and condensing exited sessions under lock (re-checking OwnerExited to avoid races). Doctor classifySession treats exited owners as stuck immediately; status shows an exited label when finalize could not complete.

Linux/darwin get real introspection; other platforms degrade to Unknown and keep the timeout fallback.

Reviewed by Cursor Bugbot for commit b7e2f3a. Configure here.

@Soph Soph requested a review from a team as a code owner June 21, 2026 19:19
Copilot AI review requested due to automatic review settings June 21, 2026 19:19

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want fixes drafted automatically? Bugbot Autofix can create code changes for findings. A team admin can enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b7e2f3a. Configure here.

Comment thread cmd/entire/cli/status.go
Comment thread cmd/entire/cli/session_finalize.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds immediate, process-based session liveness detection so sessions don’t remain ACTIVE indefinitely when an agent exits without emitting a SessionStop hook. This fits into the session/strategy lifecycle by recording an owning-process identity at turn start and sweeping/finalizing exited sessions on entire status / entire doctor.

Changes:

  • Introduces proclive for capturing/checking a process identity (PID + start fingerprint + host/boot guards).
  • Extends session state with Owner plus OwnerLiveness() / OwnerExited() and wires owner capture into turn start (InitializeSession).
  • Adds an exited-session sweep (finalizeExitedSessions) and refactors canonical session-end behavior into endSessionNow, called by both lifecycle stop and the sweep.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/owner_wiring_test.go Strategy wiring test ensuring owner capture at turn start (unix only).
cmd/entire/cli/strategy/manual_commit_hooks.go Captures owner identity on each InitializeSession (turn start).
cmd/entire/cli/status.go Runs exited-session sweep before rendering active sessions; labels exited.
cmd/entire/cli/sessions.go Updates stop flow for new markSessionEnded signature.
cmd/entire/cli/session/state.go Adds Owner field + liveness/exited helpers on session state.
cmd/entire/cli/session/owner_test.go Unit tests for nil-owner and non-active phase behavior.
cmd/entire/cli/session/owner_live_test.go Live tests for OwnerExited (linux/darwin).
cmd/entire/cli/session_finalize.go Implements finalizeExitedSessions sweep and in-memory state mutation.
cmd/entire/cli/session_finalize_test.go Tests the sweep finalization and under-lock revalidation.
cmd/entire/cli/proclive/proclive.go Core identity capture + liveness checking logic.
cmd/entire/cli/proclive/proclive_test.go Unit tests for stringer, empty identity, host mismatch, transient-name detection.
cmd/entire/cli/proclive/proclive_live_test.go Live-process tests for Alive/Dead, start mismatch, and ResolveOwner behavior.
cmd/entire/cli/proclive/proc_other.go Unsupported-platform seam (returns Unknown / no owner).
cmd/entire/cli/proclive/proc_other_test.go Verifies unsupported platforms degrade to Unknown.
cmd/entire/cli/proclive/proc_linux.go Linux /proc implementation + boot ID fallback.
cmd/entire/cli/proclive/proc_linux_test.go Tests adversarial /proc/<pid>/stat parsing.
cmd/entire/cli/proclive/proc_darwin.go Darwin sysctl-based proc stat + boottime.
cmd/entire/cli/phase_wiring_test.go Updates tests for new markSessionEnded return signature.
cmd/entire/cli/lifecycle.go Extracts endSessionNow; adds guardable markSessionEnded.
cmd/entire/cli/doctor.go Runs exited-session sweep up front; classifies exited reason distinctly.

Comment thread cmd/entire/cli/session_finalize.go Outdated
Comment thread cmd/entire/cli/proclive/proclive.go
Soph and others added 4 commits June 22, 2026 08:59
New leaf package that captures a process's identity (PID + start-time
fingerprint, plus host/boot guards) and reports whether that exact
process is still alive. ResolveOwner walks up the process tree to the
first non-shell, non-entire ancestor (the agent that spawned our hook),
skipping the Go toolchain too so local-dev's `go run` wrapper isn't
mistaken for the owner; it records no owner at all when the hostname
can't be determined, since a PID is only meaningful on its own machine.

Check returns Alive/Dead/Unknown — Dead on a missing PID, start-time
mismatch (PID reuse), or reboot, and Unknown when it can't confirm the
host/boot or the platform can't introspect (Windows), so callers fail
closed to a timeout rather than trusting a stale PID. darwin records no
boot guard: kern.boottime drifts when the wall clock is stepped (NTP),
and darwin's absolute P_starttime already distinguishes a reused PID
across reboots; Linux uses ticks-since-boot and keeps the boot_id guard.

Stdlib + golang.org/x/sys/unix only, so session/strategy/cli can import
it without an import cycle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: f6bcea2919b7
State.Owner stores the proclive.Identity captured at turn start.
OwnerLiveness/OwnerExited report when an ACTIVE session's agent process
has gone away (clean exit, crash, kill, terminal close, reboot) without
a SessionStop hook firing, falling back to the StuckActiveThreshold
timeout when liveness is Unknown (no owner, cross-host, unsupported
platform).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: f99451d13eb9
InitializeSession records the owning agent process via
proclive.ResolveOwner alongside captureSessionBranch, on every turn
start. The field is cleared first so a failed resolve never leaves a
stale (possibly dead) owner from an earlier turn that would wrongly
finalize a now-live session; re-resolving each turn also keeps the
fingerprint current across agent restarts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 1e2b78d2d708
Extract endSessionNow — the markSessionEnded + eager-condense sequence
the SessionStop hook already runs — and share it with a new
finalizeExitedSessions sweep, so the hook and the sweep stay in lockstep.
markSessionEnded gains an optional guard so the sweep re-checks
OwnerExited on the freshly-loaded state under the session-state lock,
closing a TOCTOU race where a turn could revive a session between the
list snapshot and the finalize.

entire status (human and --json) and entire doctor run the sweep up
front, finalizing any ACTIVE session whose owner process is gone instead
of leaving it "active" until the 1h StuckActiveThreshold. After
finalizing, the sweep reloads each session from disk so callers see the
true post-finalize state (condense is fail-open, so StepCount/
FullyCondensed are never assumed). Both surfaces also label such sessions
"exited" (human output, status --json, doctor's stuck-session reason) as
a fallback when finalization can't run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 3ee2f4dbe270
@Soph Soph force-pushed the soph/session-liveness-pid branch from b7e2f3a to 439be1e Compare June 22, 2026 07:00
@Soph

Soph commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Pushed fixes for the CI lint failure and all review findings (force-pushed; folded into the relevant commits):

CI lintproc_linux.go returned unwrapped os.ReadFile/strconv.Atoi errors (wrapcheck). Local lint runs on darwin so it never compiled the linux file; wrapped them and now lint both GOOS locally.

Cursor Bugbot

  • JSON status skips exited finalize (Medium)runStatusJSON now runs finalizeExitedSessions before listing, matching the human path, so --json no longer leaves orphaned ACTIVE sessions or reports them as exited.
  • FullyCondensed set optimistically (Low) — the sweep no longer hand-sets FullyCondensed; it reloads the session from disk after finalizing, so the snapshot reflects the true post-finalize state (condense is fail-open). This also fixed a latent issue where stale in-memory StepCount could make doctor re-flag a just-finalized session.

Copilot

  • FullyCondensed optimism — same fix as above.
  • ResolveOwner records empty Host on hostname failure — now fails closed: it records no owner if os.Hostname() fails, so the host guard is always present.

Trail #583 finding (Medium, macOS boottime/NTP) — darwin no longer records a boot guard; kern.boottime drifts when the wall clock is stepped and could falsely declare a live session dead. darwin relies on its absolute P_starttime fingerprint (fixed at process creation, distinguishes a reused PID across reboots); Linux keeps the boot_id guard. Resolved on the trail.

All commits build on darwin/linux/windows; mise run lint clean on both GOOS; unit + integration + canary green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants