Skip to content

feat(agent-core): add API key pool for concurrent subagent key rotation#191

Open
Liewzheng wants to merge 6 commits into
MoonshotAI:mainfrom
Liewzheng:dev-feat-subagent-optimized
Open

feat(agent-core): add API key pool for concurrent subagent key rotation#191
Liewzheng wants to merge 6 commits into
MoonshotAI:mainfrom
Liewzheng:dev-feat-subagent-optimized

Conversation

@Liewzheng
Copy link
Copy Markdown

Related Issue

Resolve #2368

Problem

When multiple subagents run in parallel, they all share the same KIMI_API_KEY. This concentrates load on a single key and quickly exhausts its rate-limit quota, causing 429 errors and degraded throughput for parallel tasks.

What changed

Add ApiKeyPool that rotates through multiple KIMI_API_KEY* environment variables with exponential cooldown on retryable errors. Integrate into ProviderManager so concurrent subagents do not share a single key's rate-limit quota. Enable via KIMI_API_KEY_POOL env var.

Key design decisions:

  • Round-robin allocation with cooldown skipping: acquire() rotates through keys and skips any that are in a cooldown period (30s → 5min → 30min based on consecutive failures). If all keys are cooling, it falls back to round-robin so the agent can still make progress.
  • Per-request key injection via resolveAuth: Instead of wrapping the provider (like kimi-cli's KeyPoolKimi), we inject rotating keys through the existing ProviderManager.resolveAuth mechanism, which already supported per-request OAuth tokens. This keeps provider construction unchanged and works naturally with Agent.generate.
  • One rotation retry before tenacity backoff: On a retryable status error (429/500/502/503/504), resolveAuth immediately records the failure and retries once with a fresh key. If that also fails, the error propagates to the existing chatWithRetry loop for normal backoff.
  • OAuth precedence preserved: If a provider has both oauth and apiKey, the existing mutual-exclusion error is thrown. The pool path is only entered when oauth is not configured.
  • Provider-type gating: Pool keys are only injected for providerConfig.type === 'kimi'. Non-kimi providers are unaffected.
  • Env var collection: ApiKeyPool.fromEnv() reads KIMI_API_KEY, KIMI_API_KEY_1KIMI_API_KEY_99. A pool is only created when ≥2 keys are found.

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked a related issue, or explained the problem above.
  • I have added tests that prove my feature works.
  • Ran gen-changesets skill, or this PR needs no changeset.
  • Ran gen-docs skill, or this PR needs no doc update. (No user-facing CLI commands or config options changed; the feature is env-var only.)

Test coverage

Unit tests (api-key-pool.test.ts — 22 tests):

  • fromEnv key collection, constructor validation, round-robin rotation
  • Cooldown skipping and fallback when all keys are cooling
  • Failure progression (30s → 5min → 30min) and key recovery after cooldown
  • Concurrent stress: 100 concurrent acquire() distribute evenly (34/33/33); cooling keys skipped under 50 concurrent acquires; full-cooldown fallback; mid-stream recovery after cooldown expires

Integration tests (runtime-provider.test.ts — 37 tests):

  • Pool key injection for kimi provider; skipped for non-kimi providers
  • Key rotation on retryable errors (429/500/502/503/504)
  • OAuth precedence when both are configured
  • Concurrent stress: 5/60 concurrent resolveAuth calls distribute keys evenly; failed keys enter cooldown and are rerouted under concurrent load

Verification

  • pnpm typecheck — all packages + apps pass
  • pnpm test — 382 test files, 4960+ tests pass
  • pnpm build — all packages and apps build successfully
  • pnpm lint — 0 errors; new files have 0 warnings (existing codebase has unrelated warnings)

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 29, 2026

🦋 Changeset detected

Latest commit: 6912d04

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@moonshot-ai/agent-core Minor
@moonshot-ai/kimi-code Minor
@moonshot-ai/migration-legacy Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36a4f98489

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +194 to +198
if (providerConfig?.type === 'kimi' && this.options.apiKeyPool !== undefined) {
const pool = this.options.apiKeyPool;
return async (request) => {
const key = pool.acquire();
const auth: ProviderRequestAuth = { apiKey: key };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect explicit Kimi provider keys

When a global pool exists, this branch wraps every type: 'kimi' provider with per-request auth, even when that provider has its own apiKey or env-derived key. In the inspected Agent.generate path, the supplied options.auth takes precedence over the provider's constructor-time apiKey, so a user who configures a custom Kimi provider/baseUrl or separate account key while also exporting KIMI_API_KEY/KIMI_API_KEY_1 will silently send those requests with the pool keys instead, leading to auth failures or wrong-account billing. Consider gating the pool to the intended managed/env-backed provider or skipping it when providerApiKey(providerConfig) is set.

Useful? React with 👍 / 👎.

Comment on lines +41 to +42
if (primary !== undefined && primary.length > 0) {
keys.push(primary);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore whitespace-only pool keys

If an operator leaves a numbered key set to whitespace (for example KIMI_API_KEY_1=' '), this check still counts it as a usable key and can create a pool containing that value; the later per-request auth path then passes it through as apiKey, bypassing the existing provider-key normalization that trims empty strings. In that environment every rotation onto the whitespace entry will make Kimi requests fail with invalid credentials, so these env values should be trimmed/validated before adding them to the pool.

Useful? React with 👍 / 👎.

Comment on lines +197 to +198
const key = pool.acquire();
const auth: ProviderRequestAuth = { apiKey: key };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep video upload and generation on the same key

Because each withAuth invocation acquires a fresh key, a video read can upload the file through createVideoUploader with one pool key and then the next LLM generation that consumes the returned ms://... file id can run with another key. In sessions with two or more pool keys and video inputs, that crosses API-key/account boundaries for file IDs and can make the follow-up Kimi request unable to access the uploaded video. The pool should pin a key for related requests, such as at agent/turn scope, rather than rotating independently per request.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 537e0d8103

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +209 to +210
if (isRetryableGenerateError(error)) {
pool.recordFailure(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quarantine pooled keys rejected with 401

When the pool contains a bad or expired KIMI_API_KEY_N, requests made with that key will fail with APIStatusError(401), but this branch only calls recordFailure for isRetryableGenerateError cases, which excludes 401. Because pooled API-key auth has no OAuth refresh path, that credential remains marked healthy and every future rotation onto it keeps failing indefinitely instead of routing around the bad key.

Useful? React with 👍 / 👎.

Comment on lines +205 to +206
const result = await request(auth);
pool.resetKey(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cooldowns across concurrent successes

With parallel subagents, the same pooled key can have multiple in-flight requests; if one request for a key returns 429 and records a cooldown, a different request for that same key that was already in flight can later succeed and this unconditional resetKey clears the cooldown immediately. In that common concurrent case the rate-limited key re-enters rotation before its backoff expires, defeating the pool’s rerouting behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5301a76fec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

this.resolveOAuthTokenProvider = options.resolveOAuthTokenProvider;
this.skillDirs = options.skillDirs ?? [];
this.telemetry = options.telemetry ?? noopTelemetryClient;
this.apiKeyPool = ApiKeyPool.fromEnv() ?? undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate key pooling on the opt-in env var

In environments that already export multiple KIMI_API_KEY* values but have not opted into pooling, this still constructs a pool unconditionally at startup. Because that same pool is passed into each session and ProviderManager injects it for Kimi providers without explicit keys, those users get request-level rotation instead of the previous single-key behavior, which can unexpectedly send traffic to a backup or different-account key. Please check the opt-in flag before creating the pool.

Useful? React with 👍 / 👎.

Liewzheng added 6 commits May 29, 2026 14:38
Add ApiKeyPool that rotates through multiple KIMI_API_KEY* env vars
with exponential cooldown on retryable errors. Integrate into
ProviderManager so concurrent subagents do not share a single key's
rate-limit quota. Enable via KIMI_API_KEY_POOL env var.
- Use Array.at() instead of bracket index access (prefer-at)
- Wrap void expressions in arrow function bodies to avoid
  no-confusing-void-expression
- Verify round-robin key distribution under concurrent acquire/load
- Verify cooling keys are skipped even under heavy concurrent load
- Verify fallback behavior when all keys are cooling
- Verify key recovery after cooldown expires
- Fix lint warnings: avoid floating promises and non-thenable in Promise.all
… trim env whitespace

Codex review fixes:
- Pool no longer overrides provider-level explicit apiKey. The pool path
  is only entered when providerApiKey() returns undefined.
- Whitespace-only env values ('   ') are now trimmed before validation
  and treated as missing.
- Update pool tests to use a dedicated POOL_BASE_CONFIG without apiKey.
…rent success

Codex review fixes:
- Pool now records failures for 401 (Unauthorized) errors so bad/expired
  keys are quarantined instead of being retried indefinitely.
- resetKey() no longer clears an active cooldown, preventing a concurrent
  success from wiping out a cooldown that was set by another in-flight
  request for the same key.
- Update resetKey test to advance past cooldown before resetting.
Codex review fix:
- Pool construction in KimiCore is now conditional on the KIMI_API_KEY_POOL
  environment variable being set. Previously, any environment with multiple
  KIMI_API_KEY* values would get request-level rotation even without opting in,
  which could silently redirect traffic to backup/different-account keys.
@Liewzheng Liewzheng force-pushed the dev-feat-subagent-optimized branch from 6698122 to 6912d04 Compare May 29, 2026 06:39
@Liewzheng Liewzheng changed the title Dev feat subagent optimized feat(agent-core): add API key pool for subagent key rotation May 29, 2026
@Liewzheng Liewzheng changed the title feat(agent-core): add API key pool for subagent key rotation feat(agent-core): add API key pool for concurrent subagent key rotation May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant