feat(subagent): add API key pool for parallel subagent execution#2369
feat(subagent): add API key pool for parallel subagent execution#2369Liewzheng wants to merge 23 commits into
Conversation
| api_key_override: str | None = None | ||
| if self._root_runtime.key_pool is not None: | ||
| api_key_override = self._root_runtime.key_pool.acquire() |
There was a problem hiding this comment.
🔴 SubagentBuilder injects KIMI pool keys into non-Kimi provider API calls
In SubagentBuilder.build_builtin_instance, when a key pool exists (runtime.key_pool is not None), a key is unconditionally acquired from the pool and passed as api_key_override regardless of the provider type. The pool only contains KIMI_API_KEY* values. If the subagent's effective model resolves to a non-Kimi provider (e.g., the user sets model to an OpenAI or Anthropic model alias in the Agent tool call), clone_llm_with_model_alias at src/kimi_cli/llm.py:417-425 passes this KIMI key to create_llm, where resolved_api_key = api_key_override is applied unconditionally at src/kimi_cli/llm.py:228. The resulting provider (OpenAI, Anthropic, etc.) would authenticate with an invalid KIMI API key, causing all API calls to fail with authentication errors.
Prompt for agents
The key pool acquisition in SubagentBuilder.build_builtin_instance should only happen when the effective provider is a Kimi provider. Currently, api_key_override is set from the pool unconditionally. The fix should check whether the provider that will be used for this subagent is of type 'kimi' before acquiring a key from the pool.
Approach: Before acquiring from the pool, determine the effective provider type. If effective_model is None, the subagent inherits the root runtime's provider — check self._root_runtime.llm.provider_config.type. If effective_model is not None, look up the model in config to get its provider type. Only acquire from the pool if the provider type is 'kimi'.
Relevant code paths:
- src/kimi_cli/subagents/builder.py:20-26 (key acquisition)
- src/kimi_cli/llm.py:228-240 (api_key_override applied unconditionally)
- src/kimi_cli/llm.py:252-253 (KeyPoolKimi wrapping only in kimi case)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 0546f6ce.
SubagentBuilder.build_builtin_instance now determines the effective provider type before acquiring from the key pool:
- If
effective_modelisNone, it checksruntime.llm.provider_config.type - If
effective_modelis set, it looks up the model inconfig.models, then resolves its provider fromconfig.providers, and readsprovider.type - Only when the resolved provider type is
"kimi"does it callkey_pool.acquire()and pass bothapi_key_overrideandkey_pooltoclone_llm_with_model_alias - For any non-kimi provider, both values stay
Noneand the subagent falls back to the provider's own API key
Added tests:
test_builder_skips_key_pool_for_non_kimi_provider— root runtime LLM hasopenai_legacyprovider → no key injectedtest_builder_skips_key_pool_when_effective_model_is_non_kimi—effective_modelresolves to a non-kimi model → no key injectedtest_builder_uses_key_pool_for_kimi_provider— kimi provider → key is acquired as before
| max_concurrent = _max_foreground_concurrency(self._runtime) | ||
| running = _count_running_foreground(self._runtime) | ||
| if running >= max_concurrent: | ||
| return ToolError( | ||
| message=( | ||
| f"Too many foreground subagents are already running " | ||
| f"({running}/{max_concurrent}). Please wait for one to finish " | ||
| f"before starting another." | ||
| ), | ||
| brief="Concurrency limit reached", | ||
| ) |
There was a problem hiding this comment.
🟡 TOCTOU race in foreground concurrency check allows exceeding the limit
The foreground concurrency check at src/kimi_cli/tools/agent/__init__.py:181-191 reads the count of running_foreground instances, but the subagent's status is not updated to running_foreground until inside runner.run(req) (after an await at line 204-205). When the LLM issues multiple concurrent Agent tool calls in a single step (dispatched via asyncio.gather), all coroutines can pass the concurrency check before any of them marks its status as running_foreground. For example, with a limit of 4, the model could issue 6 parallel Agent calls and all 6 would pass the check (seeing 0 running), exceeding the intended limit.
Prompt for agents
The concurrency check at lines 181-191 of src/kimi_cli/tools/agent/__init__.py suffers from a time-of-check-to-time-of-use (TOCTOU) race because the subagent status is only set to running_foreground inside runner.run(req), which is awaited after the check. Multiple concurrent Agent tool calls can all pass the check before any of them updates the store.
Possible fix: Eagerly create the subagent instance and set its status to running_foreground BEFORE the await, similar to how _run_in_background marks running_background synchronously before dispatching (lines 280-286). This would require creating the instance record in the foreground path as well, and rolling it back on failure. Alternatively, use a simple asyncio counter (an integer incremented before the await and decremented after) to track in-flight foreground subagents.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 0546f6ce.
The root cause was that _count_running_foreground only counted instances whose status was already "running_foreground", but that status was only set inside runner.run — after multiple await points. When the LLM issued concurrent Agent calls via asyncio.gather, every coroutine passed the check before any of them updated the store.
Fix: Eagerly create the instance and set its status to "running_foreground" before the await runner.run:
ForegroundSubagentRunner._prepare_instanceis now a synchronous public methodprepare_instanceAgentTool.__call__callsprepare_instanceandupdate_instance(status="running_foreground")right after the concurrency check, beforeawait runner.run(req, prepared)runner.runaccepts an optionalpreparedargument so it skips the internal prepare step
This mirrors the background-task pattern already used in _run_in_background, where the status is marked synchronously before dispatching the async work.
Added tests:
test_agent_tool_eagerly_sets_running_foreground_before_await— asserts that whenrunner.runis entered, the instance status is already"running_foreground"test_concurrent_agent_calls_respect_limit_after_toctou_fix— spawns a hanging foreground runner, yields control, then verifies the second concurrent Agent call is rejected by the limit
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0546f6ce13
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if key_pool is not None: | ||
| chat_provider = KeyPoolKimi(chat_provider, key_pool) |
There was a problem hiding this comment.
Preserve Kimi provider identity in key-pool wrapper
Wrapping Kimi with KeyPoolKimi changes llm.chat_provider from a Kimi instance to a different type, which breaks downstream isinstance(..., Kimi) paths used outside the two patched callsites. For example, ReadMedia only uses Kimi's server-side video upload when that check passes (src/kimi_cli/tools/file/read_media.py), so keyed subagents now fall back to inlined data URLs for videos, which can bloat prompts and fail for larger media. Make the wrapper transparent to Kimi-specific checks (or update all Kimi type checks to unwrap) to avoid this regression.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
Added KeyPoolKimi unwrapping in the two remaining isinstance(..., Kimi) checks that were missed in the initial PR:
src/kimi_cli/tools/file/read_media.py:121— Video upload now correctly uses the server-sidefiles.upload_videoAPI even when the provider is wrapped byKeyPoolKimisrc/kimi_cli/wire/server.py:625— Web UI User-Agent suffix injection now works correctly through the wrapper
Pattern matches the existing unwraps in llm.py (thinking_keep) and auth/oauth.py (token refresh).
| if runtime.key_pool is not None: | ||
| return max(1, int(runtime.key_pool.key_count * 0.8)) |
There was a problem hiding this comment.
Limit key-pool concurrency only for Kimi executions
_max_foreground_concurrency applies the key-pool cap whenever runtime.key_pool exists, regardless of the active subagent provider. Since runtime.key_pool is initialized from KIMI_API_KEY* env vars globally, non-Kimi sessions can be throttled incorrectly (e.g., OpenAI subagents being capped by unrelated Kimi key count). This should be conditioned on the effective provider for the pending foreground run, not just pool presence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
_max_foreground_concurrency now takes an optional provider_type parameter:
provider_type == "kimi"+key_poolpresent →max(1, int(key_pool.key_count * 0.8))- Any other provider type (or no key pool) →
max(1, int(max_running_tasks * 0.8))
AgentTool.__call__ resolves the effective provider type (from params.model if overridden, otherwise from runtime.llm.provider_config.type) and passes it into the concurrency check. Non-Kimi subagents are no longer throttled by an unrelated Kimi key count.
Updated tests:
test_key_pool_based_limit_for_kimi/test_key_pool_single_key_fallback_to_one_for_kimi— explicitprovider_type="kimi"test_key_pool_ignored_for_non_kimi— key pool exists but non-kimi provider usesmax_running_taskscap
| prepared = runner.prepare_instance(req) | ||
| agent_id = prepared.record.agent_id | ||
| store.update_instance(agent_id, status="running_foreground") |
There was a problem hiding this comment.
Handle foreground prepare failures inside AgentTool error path
runner.prepare_instance(req) is now executed before entering the try block, so validation failures (unknown subagent type, invalid resume id, resume of a running instance) bypass the tool's ToolError handling and surface as uncaught exceptions to the generic tool runtime wrapper. This regresses user-facing error quality and consistency compared with the prior path. Move preparation into the guarded block (or catch around it) so these expected failures return structured ToolErrors.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
runner.prepare_instance(req) and the subsequent update_instance(status="running_foreground") are now back inside the try block, so validation failures (unknown subagent type, invalid resume id, resume of a running instance) are caught by the generic except Exception handler and converted to structured ToolErrors instead of surfacing as uncaught exceptions.
agent_id is pre-initialized to None before the try block; the except handlers only touch the store when agent_id is not None, ensuring a failed prepare_instance does not trigger a secondary NameError or UnboundLocalError.
The eager status update (TOCTOU fix from 0546f6ce) is preserved because prepare_instance + update_instance still execute synchronously before the await runner.run.
💡 Codex ReviewLines 237 to 239 in 69deedd This log line emits the first 16 characters of the active API key whenever a subagent LLM is created with kimi-cli/src/kimi_cli/tools/agent/__init__.py Line 194 in 69deedd The concurrency cap is derived from Line 21 in 69deedd This change rewrites locked package sources from ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
💡 Remove API key prefix logging from LLM creationFixed in Removed the # Before
logger.info(
"Creating LLM with overridden API key {prefix}... for provider {provider_type}",
prefix=resolved_api_key[:16],
provider_type=provider.type,
)
# After
logger.info(
"Creating LLM with overridden API key for provider {provider_type}",
provider_type=provider.type,
)💡 Compute provider type from resume target before capping concurrencyFixed in
This ensures that resuming a non-Kimi subagent is not incorrectly throttled by the Kimi key-pool limit. 💡 Keep uv.lock registry URLs on the default PyPI indexFixed in Restored |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b026853e2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if isinstance(exc.__cause__, asyncio.CancelledError): | ||
| logger.warning("Foreground agent timed out after {t}s", t=timeout) | ||
| if agent_id is not None: | ||
| store.update_instance(agent_id, status="idle") |
There was a problem hiding this comment.
Keep timed-out foreground agents marked as killed
When asyncio.wait_for times out, runner.run is cancelled and already records the instance as "killed", but this handler immediately overwrites it to "idle". That state mismatch makes interrupted runs look like normal idle instances in status views/metrics and can mislead follow-up automation that relies on subagent status semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4e8d6ac1.
Removed the store.update_instance(agent_id, status="idle") line from the TimeoutError handler in AgentTool.__call__.
When asyncio.wait_for cancels runner.run on timeout, the runner's internal except asyncio.CancelledError block already marks the instance as "killed" (and re-raises). Overwriting that to "idle" in the outer handler made interrupted runs appear as normal idle instances in status views. The instance now correctly stays "killed".
| logger.info( | ||
| "Subagent {agent_id} assigned API key {prefix}... (pool size {n})", | ||
| agent_id=agent_id, | ||
| prefix=api_key_override[:16], |
There was a problem hiding this comment.
Remove API key material from subagent assignment logs
This log line writes the first 16 characters of each assigned API key to plaintext logs. Even partial key disclosure is sensitive in shared CI/support log pipelines and creates avoidable credential leakage risk; key assignment should be logged without exposing any key bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4e8d6ac1.
Removed the prefix=api_key_override[:16] parameter from the logger.info call in SubagentBuilder.build_builtin_instance.
# Before
logger.info(
"Subagent {agent_id} assigned API key {prefix}... (pool size {n})",
agent_id=agent_id,
prefix=api_key_override[:16],
n=self._root_runtime.key_pool.key_count,
)
# After
logger.info(
"Subagent {agent_id} assigned API key from pool (pool size {n})",
agent_id=agent_id,
n=self._root_runtime.key_pool.key_count,
)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e8d6ac18e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -119,7 +119,12 @@ async def _read_media(self, path: KaosPath, file_type: FileType) -> ToolReturnVa | |||
| case "video": | |||
| data = await path.read_bytes() | |||
| if (llm := self._runtime.llm) and isinstance(llm.chat_provider, Kimi): | |||
There was a problem hiding this comment.
Unwrap KeyPoolKimi before the Kimi video check
The current diff still checks llm.chat_provider for Kimi before the new unwrap runs, so when key pooling wraps the provider as KeyPoolKimi this branch is skipped and Kimi subagents fall back to embedding video data URLs instead of using files.upload_video. Fresh evidence after the earlier thread is that the unwrap was added inside the branch, but the outer guard remains on the wrapped object; move the unwrap before this isinstance check.
Useful? React with 👍 / 👎.
| async def generate(self, *args: Any, **kwargs: Any) -> Any: | ||
| return await self._provider.generate(*args, **kwargs) | ||
|
|
||
| def on_retryable_error(self, error: BaseException) -> bool: |
There was a problem hiding this comment.
Rotate pooled keys before retrying status errors
This new wrapper only rotates keys when on_retryable_error is called, but _run_with_connection_recovery currently invokes that hook only for APIConnectionError/APITimeoutError; 429/500/503 APIStatusErrors are retried by tenacity with the same client and same exhausted key. In the rate-limit scenario this feature targets, a subagent that receives HTTP 429 will keep retrying the same key even though the pool has other keys available, so status-error retries need to trigger the same rotation path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bddb1c3f7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Enforce foreground concurrency limit (80% of system capacity). | ||
| # For resumed instances, derive the provider type from the stored launch spec | ||
| # so that non-Kimi resumed subagents are not capped by an unrelated key pool. | ||
| effective_model = params.model |
There was a problem hiding this comment.
Resolve default subagent model before applying Kimi cap
For a new foreground launch with no explicit model, this leaves effective_model as None, so _resolve_effective_provider_type falls back to the root provider even though prepare_instance later records req.model or type_def.default_model and SubagentBuilder.resolve_effective_model will run the subagent on its type default. In a Kimi-root session with a key pool, a subagent type whose default model is OpenAI/Anthropic is therefore treated as Kimi for the foreground key-pool cap and can be rejected at the much lower Kimi limit before it ever starts.
Useful? React with 👍 / 👎.
| effective_model = launch.model_override or launch.effective_model | ||
| provider_type = _resolve_effective_provider_type(self._runtime, effective_model) | ||
| max_concurrent = _max_foreground_concurrency(self._runtime, provider_type) | ||
| running = _count_running_foreground(self._runtime) |
There was a problem hiding this comment.
Count only matching-provider foreground runs for the Kimi cap
running counts every running_foreground instance, but max_concurrent may be the Kimi key-pool capacity for the pending request. When an OpenAI/default-model foreground subagent is already running and a Kimi subagent is launched with a small key pool (for example two keys gives a limit of one), the Kimi request is rejected as 1/1 even though no Kimi key is currently in use. The count needs to be scoped to the same effective provider/key-pool class as the cap being enforced.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 681ee768c4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if timeout is not None: | ||
| return await asyncio.wait_for(runner.run(req), timeout=timeout) | ||
| return await runner.run(req) | ||
| return await asyncio.wait_for(runner.run(req, prepared), timeout=timeout) |
There was a problem hiding this comment.
Clear pre-run timeouts from running_foreground
When the foreground timeout fires while runner.run() is still in its pre-execution preparation (for example prepare_soul() awaiting explore-agent git context collection), the runner has not reached its except asyncio.CancelledError block that marks the instance killed. Because this code now marks the instance running_foreground before the wait_for, the timeout handler returns without changing that status, leaving the subagent counted as running and causing later Agent calls to hit the concurrency limit until restart. Mark the record killed/failed here when it is still running_foreground.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2b1f2c9a9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async def generate(self, *args: Any, **kwargs: Any) -> Any: | ||
| try: | ||
| return await self._provider.generate(*args, **kwargs) | ||
| except APIStatusError as exc: |
There was a problem hiding this comment.
Rotate keys for streaming status retries
When a retryable Kimi status error is raised while consuming the streaming response, this try has already returned the KimiStreamedMessage, so the wrapper never calls on_retryable_error; KimiStreamedMessage._convert_stream_response converts OpenAIError/httpx.HTTPError into APIStatusError during async for, and KimiSoul will retry 429/500/503/504 with the same exhausted client/key. Fresh evidence after the earlier status-retry fix is that only stream creation is covered here, not iteration, so pooled subagents can still hammer one key on streaming server/rate-limit failures.
Useful? React with 👍 / 👎.
💡 Codex Reviewkimi-cli/src/kimi_cli/auth/oauth.py Line 1090 in d1be359 When a logged-in Kimi configuration also sets Lines 154 to 156 in d1be359 With key health tracking enabled, failures are recorded here but no success path ever calls ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
- Introduce APIKeyPool for round-robin key rotation across subagents - Add KeyPoolKimi wrapper to retry with fresh keys on rate limits - Inject subagent type into User-Agent for backend attribution - Add foreground concurrency limit (80% of key count or task slots) - Add configurable foreground timeout via KIMI_FOREGROUND_AGENT_TIMEOUT - Fix oauth.py assertion for KeyPoolKimi-wrapped providers
…U race in foreground concurrency check
…l concurrency to kimi; guard prepare_instance failures
…ncy cap; restore uv.lock
- read_media.py: unwrap KeyPoolKimi BEFORE isinstance(Kimi) guard so video uploads correctly use server-side files.upload_video API. - llm.py: intercept APIStatusError (429/500/503) inside KeyPoolKimi.generate() to rotate the pooled key BEFORE re-raising for tenacity retry. - tests: add parametrized tests for retryable status codes and a negative test for non-retryable codes (400).
… for kimi cap - Resolve default subagent model before applying Kimi cap: when no explicit model is given, look up the built-in type's default_model so that a kimi default correctly triggers the key-pool cap. - Count only matching-provider foreground runs for the Kimi cap: _count_running_foreground now optionally filters by provider_type so that openai/anthropic/etc. subagents do not consume the kimi key-pool quota. - Add tests for provider-filtered counting and default-model resolution.
When asyncio.wait_for times out while runner.run() is still in its pre-execution preparation (e.g. prepare_soul awaiting explore-agent git context collection), the runner's except asyncio.CancelledError block that marks the instance 'killed' may not have run yet. Because we now eagerly mark the instance 'running_foreground' before wait_for, the TimeoutError handler must clean up the status when it is still 'running_foreground', otherwise the subagent remains counted as running and later Agent calls hit the concurrency limit. - In AgentTool.__call__, when TimeoutError.__cause__ is CancelledError, check if the instance is still 'running_foreground' and mark it 'killed'. - Add test_timeout_cleans_running_foreground_status to verify the cleanup.
Centralize key rotation for retryable APIStatusError (429/500/502/503/504) in KimiSoul._run_with_connection_recovery instead of KeyPoolKimi.generate. - KeyPoolKimi.generate no longer catches APIStatusError; this prevents double rotation when generate() fails and the error is also caught by the soul. - _run_with_connection_recovery now treats retryable status errors the same way as connection errors: calls chat_provider.on_retryable_error once per tenacity attempt, then re-enters with _status_retried=True. - This covers both initial generate() failures AND mid-stream failures during async iteration, which were previously missed by the generate() wrapper. Tests: - Update test_step_status_error to expect recovery_calls == 1. - Add test_step_mid_stream_status_error_triggers_recovery with a RetryableChatProvider that yields partial stream then errors, verifying on_retryable_error is called before the retry. - Remove obsolete KeyPoolKimi.generate APIStatusError tests.
…i_provider Add a public utility in and replace the four manual unwrap sites (oauth.py, read_media.py, server.py, llm.py's thinking_keep check) so that future checks do not need to remember to unwrap the wrapper.
…APIKeyPool APIKeyPool now tracks per-key failure state: - record_failure(key) applies exponential cooldown: 30s -> 5min -> 30min - acquire() skips keys that are in cooldown, falling back to round-robin only when every key is cooling down - reset_key(key) clears the failure state KeyPoolKimi.on_retryable_error records the old key's failure before rotating to the next one, so exhausted keys are not immediately reused. Tests cover cooldown skipping, all-cooldown fallback, exponential progression, and reset behaviour.
When a key pool is active, _apply_access_token must not overwrite chat_provider.client.api_key, because the pool (not OAuth) manages the key. Overwriting would silently replace the pooled key with the OAuth token, breaking subsequent rotation.
When a key's cooldown period expires and it is re-acquired, reset its consecutive_failures counter back to zero. Without this, recovered keys start from higher cooldown tiers on their next failure, even though they have demonstrated recovery.
ed11ebb to
ee7452c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee7452cb80
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if runtime.key_pool is not None: | ||
| return |
There was a problem hiding this comment.
Only skip OAuth refresh for wrapped pooled providers
When a user is logged in with the managed OAuth Kimi provider and also sets multiple KIMI_API_KEY* values for subagents, the root runtime.key_pool is non-null even though the root runtime.llm.chat_provider was created before the pool and is not a KeyPoolKimi. In that setup a 401 refresh updates the token cache but this early return leaves the root client's api_key on the expired token, so the retry keeps failing; gate this on the active chat provider actually being the key-pool wrapper rather than on the global pool existing.
Useful? React with 👍 / 👎.
| # Cooldown expired — reset the key to healthy. | ||
| self._states[key] = _KeyState() | ||
| return key | ||
| # All keys in cooldown — fall back to the current slot. |
There was a problem hiding this comment.
Advance the fallback slot when all keys are cooling down
If every key is in cooldown, the loop wraps _index back to its original value and this fallback returns that same key without advancing it. Subsequent retries while the pool is still fully cooling down will therefore hammer one key instead of round-robin falling back across the pool, which defeats the rotation behavior exactly after widespread 429/5xx failures.
Useful? React with 👍 / 👎.
When every key is in cooldown, the fallback path returned the same key without advancing _index. Subsequent acquire() calls would hammer that single key instead of round-robin falling back across the pool.
Gating _apply_access_token on runtime.key_pool is too broad: a user may have a key pool configured for subagents while the root runtime still uses an unwrapped OAuth-based provider. In that case a 401 refresh would update the token cache but leave the root client's api_key on the expired token, causing retries to keep failing. Now we check isinstance(runtime.llm.chat_provider, KeyPoolKimi) so the skip only applies when the active provider is actually managed by the pool.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53efcfde23
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| | `resume` | string | 可选的 Agent 实例 ID,用于恢复现有实例 | | ||
| | `run_in_background` | bool | 是否在后台运行,默认 false | | ||
| | `timeout` | int | 超时时间(秒),范围 30–3600。前台默认无超时(运行到完成),后台默认 15 分钟;超时后任务会被停止 | | ||
| | `timeout` | int | 超时时间(秒),范围 30–3600。前台默认 300 秒(或由 `KIMI_FOREGROUND_AGENT_TIMEOUT` 指定),后台默认 15 分钟;超时后任务会被停止。设置为 `0` 可关闭前台超时限制 | |
There was a problem hiding this comment.
Align the documented timeout escape hatch with the schema
When users follow this new table and pass timeout: 0 to Agent to disable the foreground timeout, the tool call is rejected before it reaches _resolve_foreground_timeout because Params.timeout is constrained with ge=30 in src/kimi_cli/tools/agent/__init__.py. The 0 escape hatch currently only works through KIMI_FOREGROUND_AGENT_TIMEOUT, so either allow 0 in the parameter schema or remove this instruction from the docs.
Useful? React with 👍 / 👎.
The docs stated that passing timeout: 0 disables the foreground timeout, but Params.timeout was constrained with ge=30, so Pydantic rejected the value before it ever reached _resolve_foreground_timeout. - Change ge=30 to ge=0 in the Field constraint - Treat params_timeout == 0 as 'disabled' (same semantics as KIMI_FOREGROUND_AGENT_TIMEOUT=0) - Update inline snapshots in test_default_agent and test_tool_schemas
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e4267f6fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "KIMI_FOREGROUND_AGENT_TIMEOUT. Background: default from config " | ||
| "(15min), max 3600s (1hr). The agent is stopped if it exceeds " | ||
| "this limit." | ||
| ), |
There was a problem hiding this comment.
Reject zero timeout for background agents
When run_in_background=true, this relaxed schema now accepts timeout: 0, but the background path passes that value through to BackgroundAgentRunner, which treats any non-None timeout as asyncio.wait_for(..., timeout=0) and immediately marks the task timed out. Since 0 is only documented/handled as the foreground timeout escape hatch, background Agent calls can now create tasks that fail instantly instead of being rejected or using the configured background default.
Useful? React with 👍 / 👎.
The previous fix allowed timeout=0 in the schema for foreground agents (to disable the timeout), but background tasks passed that 0 through to BackgroundAgentRunner, which treats any non-None timeout as asyncio.wait_for(..., timeout=0) and immediately marks the task timed out. Zero timeout is now explicitly rejected in the background path with a clear ToolError, since background tasks must have a positive timeout.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 100b9bcbcd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if params.resume: | ||
| record = store.get_instance(params.resume) | ||
| if record is not None: | ||
| launch = record.launch_spec | ||
| effective_model = launch.model_override or launch.effective_model |
There was a problem hiding this comment.
Respect resume model overrides when applying provider caps
When resuming an agent with an explicit model override, this branch overwrites the already-validated params.model with the stored launch spec, but ForegroundSubagentRunner.run() later does run the resumed agent with req.model. In a Kimi-key-pool session this can reject an OpenAI/Anthropic override under the Kimi cap, or let a Kimi override bypass that cap, because the concurrency check classifies the pending run using the old stored provider rather than the model that will actually execute.
Useful? React with 👍 / 👎.
When resuming an agent with an explicit model override, the concurrency check was using the stored launch_spec's old model instead of the override, which could reject an OpenAI/Anthropic override under the Kimi cap, or let a Kimi override bypass that cap. Fix: only fall back to the stored launch spec when no override is given. Also fixes _count_running_foreground to filter by provider type for ALL providers, not just kimi. Previously non-kimi caps were incorrectly consumed by unrelated-provider subagents because the filter was only applied when provider_type == 'kimi'.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60d70eb6dc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| instance_provider = _resolve_effective_provider_type(runtime, r.launch_spec.effective_model) | ||
| if instance_provider == provider_type: |
There was a problem hiding this comment.
Count resumed override runs under the executing provider
Fresh evidence after the resume-override fix is that running counts still classify each active instance from the persisted launch_spec.effective_model, but ForegroundSubagentRunner.run() applies a resume model override only to a local launch_spec and SubagentStore.update_instance() never persists that override. If an old OpenAI/default instance is resumed with a Kimi model while a Kimi key pool cap is active, this count ignores the running Kimi execution, so subsequent Kimi foreground agents can bypass the key-pool concurrency limit.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb47dfcbbb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| store.update_instance( | ||
| params.resume, | ||
| launch_spec=replace( | ||
| launch, | ||
| model_override=effective_model, |
There was a problem hiding this comment.
Defer persisting resume model override until launch succeeds
When resuming with an explicit model, this updates the stored launch spec before the request has passed the foreground concurrency check or prepare_instance's running-instance validation. If the override call is rejected because the provider cap is full, or because the target agent is already running_foreground/running_background, no new execution happens but the instance metadata is still changed, so later resumes inherit a model that was never successfully launched and active-provider counting can be skewed for an already-running agent. Persist the override only after the resume is accepted, or roll it back on these early failure paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bcd5f0bf1b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| max_concurrent = _max_foreground_concurrency(self._runtime, provider_type) | ||
| # Count only matching-provider runs so that a cap for one provider is not | ||
| # consumed by subagents running on a different provider. | ||
| running = _count_running_foreground(self._runtime, provider_type) |
There was a problem hiding this comment.
Keep the non-Kimi foreground cap global
When the pending run is not using the Kimi key-pool cap, this still filters the running count by provider type while max_concurrent comes from the global background.max_running_tasks capacity. In a mixed-provider session, for example with max_running_tasks=4 (foreground cap 3), three OpenAI foreground agents can already be running and an Anthropic foreground launch will see running == 0 and be allowed, bypassing the intended global foreground limit. Scope the count by provider only when enforcing the Kimi key-pool cap; otherwise count all running_foreground instances.
Useful? React with 👍 / 👎.
Related Issue
Closes #2368
Description
1. API Key Pool for Parallel Subagent Execution
This PR introduces
APIKeyPool(src/kimi_cli/llm_key_pool.py), a round-robin API key allocator designed for parallel subagent execution.APIKeyPool.from_env("KIMI_API_KEY")collects keys fromKIMI_API_KEY,KIMI_API_KEY_1,KIMI_API_KEY_2, … up toKIMI_API_KEY_99. A pool is only created when ≥2 keys are found; otherwise it returnsNoneand subagents fall back to the root runtime's key.acquire()returns the next key in rotation, ensuring concurrent subagents are distributed across different API keys instead of hammering a single key's rate-limit quota.2. KeyPoolKimi Wrapper
Added
KeyPoolKimi(src/kimi_cli/llm.py), a thin wrapper around the KosongKimiprovider:generate,with_thinking,with_generation_kwargs,with_extra_body, etc.)on_retryable_error()swaps in the next key from the pool and recreates the HTTP clientproviderproperty for unwrapping when type checks are needed3. Subagent User-Agent Attribution
SubagentBuildernow injects a subagent-specificUser-Agentheader:This allows the Kimi backend to distinguish root agent calls from subagent calls for monitoring and attribution.
4. Foreground Concurrency Limit
Added
_max_foreground_concurrency()insrc/kimi_cli/tools/agent/__init__.py:max(1, int(key_count * 0.8))max(1, int(background.max_running_tasks * 0.8))ToolErrorinstead of blocking5. Configurable Foreground Timeout
Changed the default foreground subagent timeout from "no limit" to 300 seconds (5 minutes):
timeoutparameter still takes highest priorityKIMI_FOREGROUND_AGENT_TIMEOUTenvironment variable overrides the default (set to0to disable)6. OAuth Compatibility Fix
Fixed
oauth.py_apply_access_token()which had a hardisinstance(chat_provider, Kimi)assertion. WhenKeyPoolKimiwraps the provider, the assertion failed and crashed the subagent. The fix unwrapsKeyPoolKimibefore the type check:7.
KIMI_MODEL_THINKING_KEEPCompatibility FixThe
thinking_keepcheck also used a bareisinstance(chat_provider, Kimi)assertion. When key pooling is enabled, the provider is wrapped inKeyPoolKimi, so the check never matched and the extra body was never injected. Fixed by unwrapping before the type check.8.
clone_llm_with_model_aliasFixWhen
model_alias=Nonebutapi_key_override,key_pool, orextra_headerswere passed, the function incorrectly returned the original LLM unchanged. This meant subagents could not receive a distinct API key when no model override was requested. Fixed to recreate the LLM from storedprovider_config/model_configwhen override parameters are present.9. Logger UnboundLocalError Fix
Removed a local
from kimi_cli.utils.logging import loggerinsidecreate_llmthat shadowed the module-level import. Whenapi_key_overridewas set butoauthwasNone, the local import was skipped and the subsequentlogger.info(...)call raisedUnboundLocalError.10. Default HTTP Timeout for OpenAI Client
Added a default
httpx.Timeout(connect=10, read=300, write=30, pool=30)inopenai_common.pyto prevent indefinite API hangs when no timeout is configured.11. Subagent Live Visualization
Enhanced the shell UI to show subagent progress in real time:
Step 2 ⠋ 12s)_LiveViewhandlesTextPartandStepBeginevents from the subagent wire streamBreaking Changes
KIMI_FOREGROUND_AGENT_TIMEOUT=0or pass an explicittimeoutparameter.src/kimi_cli/llm.py,src/kimi_cli/llm_key_pool.py,src/kimi_cli/app.py,src/kimi_cli/soul/agent.pysrc/kimi_cli/subagents/builder.py,src/kimi_cli/tools/agent/__init__.pysrc/kimi_cli/auth/oauth.pysrc/kimi_cli/ui/shell/visualize/_blocks.py,src/kimi_cli/ui/shell/visualize/_live_view.pypackages/kosong/src/kosong/chat_provider/openai_common.pytests/core/test_llm_key_pool.py,tests/core/test_key_pool_kimi.py,tests/core/test_foreground_concurrency.py,tests/core/test_foreground_agent_timeout.py,tests/core/test_subagent_builder.py,tests/tools/test_tool_schemas.py,tests/core/test_default_agent.py,tests/ui/test_usage.pyCHANGELOG.md,docs/en/configuration/env-vars.md,docs/en/customization/agents.md,docs/en/release-notes/changelog.md,docs/zh/configuration/env-vars.md,docs/zh/customization/agents.md,docs/zh/release-notes/changelog.mdTesting
test_llm_key_pool.py: verifiesAPIKeyPool.from_envkey collection, round-robin ordering, and empty-pool behaviortest_key_pool_kimi.py: verifiesKeyPoolKimiattribute forwarding andon_retryable_errorkey rotationtest_foreground_concurrency.py: verifies concurrency cap enforcement andToolErrorrejection when limit is reachedtest_foreground_agent_timeout.py: verifies env var override, default 300s,0= no limit, and invalid env fallbacktest_subagent_builder.py: adjusted mock signature forclone_llm_with_model_aliastest_tool_schemas.py: refreshed inline snapshot for newtimeoutdefaulttest_default_agent.py: updated agent description assertiontest_usage.py: adjusted for rich 14.xgrey23empty-segment filteringChecklist
make gen-changelogto update the changelog.make gen-docsto update the user documentation.