keeper: expose CCB_KEEPER_PING_TIMEOUT_S env override for config-check ping#186
Open
SevenX77 wants to merge 1 commit intobfly123:mainfrom
Open
keeper: expose CCB_KEEPER_PING_TIMEOUT_S env override for config-check ping#186SevenX77 wants to merge 1 commit intobfly123:mainfrom
SevenX77 wants to merge 1 commit intobfly123:mainfrom
Conversation
Keeper calls `CcbdClient(...).ping('ccbd')` with a hardcoded 0.2s timeout
during every reconcile tick. When ccbd is busy (paste+verify, poll loop),
the ping races → keeper marks lifecycle.failed:config_check_failed:timed out
→ all `ccb ask` return socket_unreachable until manual restart.
- add `CCB_KEEPER_PING_TIMEOUT_S` (default 2.0s, invalid/neg/empty → default)
- allowlist the env in runtime_env/control_plane.py
Mirrors the earlier CCB_CCBD_CLIENT_TIMEOUT_S fix (CLI→ccbd path).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the hardcoded
0.2stimeout indaemon_matches_project_config'sCcbdClient.ping('ccbd')call with_keeper_ping_timeout_s(), a helper that readsCCB_KEEPER_PING_TIMEOUT_S(default2.0s) and falls back safely on invalid / empty input.Why
The keeper's reconcile loop calls
CcbdClient(socket_path, timeout_s=0.2).ping('ccbd')on every tick to verify config identity. 0.2s is aggressive: whenever ccbd is in the middle of apaste + verifycycle or a completion-tracker poll burst, the ping races against a busy event loop, the keeper marks the lifecycle asfailed:config_check_failed:timed out, and from that point everyccb askreturnssocket_unreachableuntil the user manuallyccb kill+ restart.We've hit this repeatedly on a project that does heavy multi-agent dispatch (Gemini analyst + Codex reviewer in parallel). Raising the timeout to 2s (env-overridable) eliminates the false-positive "failed" transitions without weakening the actual-unreachable detection — a healthy ccbd responds to ping in <10ms, so the extra headroom costs nothing on the success path.
Scope
lib/ccbd/keeper_runtime/loop.py— add_keeper_ping_timeout_s()helper; plumb into the single call site indaemon_matches_project_config.lib/runtime_env/control_plane.py— allowlistCCB_KEEPER_PING_TIMEOUT_Sso the keeper subprocess inherits the env var.Mirrors the precedent set by
CCB_CCBD_CLIENT_TIMEOUT_S(CLI→ccbd path).What is unchanged
CcbdClient(...)call sites with explicittimeout_s=(daemon_process health probe at0.2s, etc.) — untouched.CCB_KEEPER_PING_TIMEOUT_Sis not set (or is empty / invalid / non-positive), the new default is 2.0s. This is a default change from upstream's 0.2s.Alternative: keep default at 0.2s
If you prefer the upstream default unchanged and only want the env-override capability, the helper is trivial to switch — replace
return 2.0withreturn 0.2in_keeper_ping_timeout_s(). The env-override behavior itself is what resolves the operational pain; the default bump is our recommendation but not load-bearing.Test plan
pytest test/ -k "keeper"→ 13 passedpersonalfork for ~24h: config_check_failed transitions eliminated; no observed regression.