Skip to content

keeper: expose CCB_KEEPER_PING_TIMEOUT_S env override for config-check ping#186

Open
SevenX77 wants to merge 1 commit intobfly123:mainfrom
SevenX77:td-006-keeper-ping-timeout-env-override
Open

keeper: expose CCB_KEEPER_PING_TIMEOUT_S env override for config-check ping#186
SevenX77 wants to merge 1 commit intobfly123:mainfrom
SevenX77:td-006-keeper-ping-timeout-env-override

Conversation

@SevenX77
Copy link
Copy Markdown

Summary

Replace the hardcoded 0.2s timeout in daemon_matches_project_config's CcbdClient.ping('ccbd') call with _keeper_ping_timeout_s(), a helper that reads CCB_KEEPER_PING_TIMEOUT_S (default 2.0s) and falls back safely on invalid / empty input.

Why

The keeper's reconcile loop calls CcbdClient(socket_path, timeout_s=0.2).ping('ccbd') on every tick to verify config identity. 0.2s is aggressive: whenever ccbd is in the middle of a paste + verify cycle or a completion-tracker poll burst, the ping races against a busy event loop, the keeper marks the lifecycle as failed:config_check_failed:timed out, and from that point every ccb ask returns socket_unreachable until the user manually ccb kill + restart.

We've hit this repeatedly on a project that does heavy multi-agent dispatch (Gemini analyst + Codex reviewer in parallel). Raising the timeout to 2s (env-overridable) eliminates the false-positive "failed" transitions without weakening the actual-unreachable detection — a healthy ccbd responds to ping in <10ms, so the extra headroom costs nothing on the success path.

Scope

  • lib/ccbd/keeper_runtime/loop.py — add _keeper_ping_timeout_s() helper; plumb into the single call site in daemon_matches_project_config.
  • lib/runtime_env/control_plane.py — allowlist CCB_KEEPER_PING_TIMEOUT_S so the keeper subprocess inherits the env var.

Mirrors the precedent set by CCB_CCBD_CLIENT_TIMEOUT_S (CLI→ccbd path).

What is unchanged

  • All other CcbdClient(...) call sites with explicit timeout_s= (daemon_process health probe at 0.2s, etc.) — untouched.
  • If CCB_KEEPER_PING_TIMEOUT_S is not set (or is empty / invalid / non-positive), the new default is 2.0s. This is a default change from upstream's 0.2s.

Alternative: keep default at 0.2s

If you prefer the upstream default unchanged and only want the env-override capability, the helper is trivial to switch — replace return 2.0 with return 0.2 in _keeper_ping_timeout_s(). The env-override behavior itself is what resolves the operational pain; the default bump is our recommendation but not load-bearing.

Test plan

  • pytest test/ -k "keeper" → 13 passed
  • Running on personal fork for ~24h: config_check_failed transitions eliminated; no observed regression.

Keeper calls `CcbdClient(...).ping('ccbd')` with a hardcoded 0.2s timeout
during every reconcile tick. When ccbd is busy (paste+verify, poll loop),
the ping races → keeper marks lifecycle.failed:config_check_failed:timed out
→ all `ccb ask` return socket_unreachable until manual restart.

- add `CCB_KEEPER_PING_TIMEOUT_S` (default 2.0s, invalid/neg/empty → default)
- allowlist the env in runtime_env/control_plane.py

Mirrors the earlier CCB_CCBD_CLIENT_TIMEOUT_S fix (CLI→ccbd path).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant