Skip to content

fix(soul): repair orphan tool_calls when replaying history#2383

Open
Pluviobyte wants to merge 1 commit into
MoonshotAI:mainfrom
Pluviobyte:fix/normalize-history-orphan-tool-calls
Open

fix(soul): repair orphan tool_calls when replaying history#2383
Pluviobyte wants to merge 1 commit into
MoonshotAI:mainfrom
Pluviobyte:fix/normalize-history-orphan-tool-calls

Conversation

@Pluviobyte
Copy link
Copy Markdown

@Pluviobyte Pluviobyte commented May 28, 2026

Related Issue

Resolve #2336

Description

When a kimi-cli session is killed mid-turn — the reporter describes high memory pressure, but the same failure shape can come from kill -9, terminal close, OOM, etc. — the persisted context.jsonl can contain an assistant message whose tool_calls were written without the matching tool role responses. On resume the next provider request fails with:

400 ... an assistant message with 'tool_calls' must be followed by tool messages
responding to each 'tool_call_id'. The following tool_call_ids did not have
response messages: Shell:206

Because the corrupted history is permanent on disk, every retry of the same prompt hits the same error and the saved conversation becomes a write-off.

Fix

normalize_history() already runs immediately before each provider call (KimiSoul._step, side-question replay in btw.py). Extend it with a final pass that scans for assistant messages carrying tool_calls, looks at the immediately following tool role messages, and inserts a synthetic tool placeholder for any tool_call_id that was never responded to.

  • The persisted history is not modified — only the message list handed to the provider.
  • Well-formed histories take the same path as before: when no orphan is found the helper returns the input unchanged.
  • The placeholder body is short ((tool result unavailable: the previous session was interrupted before this tool call completed)) so it adds minimal context tokens on resume while giving the model a clear signal that the previous tool result is missing.
  • Partial orphans (parallel tool calls where some have responses and some don't) only get placeholders for the missing ids.

What this PR does not change

It does not change the persistence layer to flush tool responses atomically — the on-disk history can still be left in the same state under a hard crash. The aim here is to make a saved session recoverable, not to prevent the corruption in the first place. Happy to land an atomic-flush follow-up if that direction is in scope.

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked the related issue.
  • I have added tests that prove the fix is effective.
  • I have updated CHANGELOG.md (manual Unreleased entry, following the existing one-line-per-bullet style; make gen-changelog not run because the change is mechanical).
  • make gen-docs not run — N/A: no user-visible CLI / config change.

Test plan

  • uv run ruff check src/kimi_cli/soul/dynamic_injection.py tests/core/test_normalize_history.py → clean
  • uv run ruff format --check src/kimi_cli/soul/dynamic_injection.py tests/core/test_normalize_history.py → clean
  • uv run pyright src/kimi_cli/soul/dynamic_injection.py tests/core/test_normalize_history.py0 errors, 0 warnings, 0 informations
  • uv run pytest tests/core/test_normalize_history.py tests/core/test_kimisoul_steer.py tests/core/test_soul_message.py -q52 passed

Proof of fix

New tests in tests/core/test_normalize_history.py:

  • test_orphan_tool_call_synthesized_when_followed_by_user — exact [Bug] Session corruption under memory pressure: lost conversation + 400 tool_call response error on resume #2336 shape (assistant with tool_calls, then user turn, no tool message in between).
  • test_orphan_tool_call_synthesized_at_history_tail — assistant with tool_calls is the last message; resume must still work.
  • test_complete_tool_response_not_duplicated — well-formed pair is untouched.
  • test_partial_orphan_only_missing_ids_synthesized — parallel tool_calls, only the missing id gets a placeholder.
  • test_multiple_assistant_tool_call_groups_independent — an earlier orphan does not steal responses meant for a later assistant message.
  • test_assistant_without_tool_calls_untouched — assistant with no tool_calls is passed through unchanged.

Made with Cursor


Open in Devin Review

If a previous session was killed mid-turn (e.g. under memory pressure),
the persisted history can contain an assistant message whose tool_calls
were written without the matching tool-role responses. On resume, the
provider rejects the next request with `400 ... tool_call_ids did not
have response messages`, leaving the conversation permanently
unresumable.

normalize_history() now scans for orphan tool_call_ids and inserts a
short placeholder tool message for each before the request goes out.
The persisted history is untouched -- only the wire-shape sent to the
provider is patched, so existing well-formed turns are left as-is and
the next user message can proceed normally.

Fixes MoonshotAI#2336

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Session corruption under memory pressure: lost conversation + 400 tool_call response error on resume

1 participant