Skip to content

feat(server): optionally prepend tool schemas to the system prompt#68

Open
unsaltedbutter-ai wants to merge 1 commit into
antirez:mainfrom
unsaltedbutter-ai:feat/tools-prepend-system
Open

feat(server): optionally prepend tool schemas to the system prompt#68
unsaltedbutter-ai wants to merge 1 commit into
antirez:mainfrom
unsaltedbutter-ai:feat/tools-prepend-system

Conversation

@unsaltedbutter-ai
Copy link
Copy Markdown

Summary

Adds --tools-prepend-system, an opt-in flag that renders ds4's auto-injected tool boilerplate at the start of the system prompt instead of after the client's system content. The model sees the same content in the same role; only the order changes. With the flag enabled, the client's system content (including any dynamic tail it contains) sits immediately before <|User|>, so the existing --kv-cache-boundary-trim-tokens knob can chop just those bytes and keep the tool-schema region in the cached prefix.

Motivation

render_chat_prompt_text injects ## Tools and ### Available Tool Schemas instructions into every prompt whose request carries a tools field. Today that block is appended after the client's own system content, producing this layout:

<|begin▁of▁sentence|>
[client's system content, possibly with a small dynamic tail]
[ds4-injected tool schemas + boilerplate]
<|User|>
[user message]

For tool-using agents whose system message has a small dynamic field, the variable bytes sit between two stable regions. There is no length-based cache cut that excludes the dynamic content while including any of the tool schemas. Every cross-session lookup misses for the same structural reason regardless of how --kv-cache-boundary-trim-tokens is tuned.

For example, a Hermes-style agent typically emits a few lines at the end of its system prompt summarizing the session context:

Conversation started: Sunday, Oct 31, 2008 03:00 PM
Model: deepseek-v4-flash
Provider: custom

Host: macOS (26.4.1)
User home directory: /Users/snaka
Current working directory: /Users/snaka/Documents

The first line varies between sessions (timestamp) while the rest is stable. Under the current layout that block lives in the middle of the rendered system region, with the much larger tool-schema region after it. There is no horizontal cut that captures any tool schemas without also capturing the variable timestamp, so the KV cache cannot be reused across sessions.

With --tools-prepend-system, the layout becomes:

<|begin▁of▁sentence|>
[ds4-injected tool schemas + boilerplate]   <-- byte-stable across requests
[client's system content, dynamic tail intact]
<|User|>
[user message]

The dynamic content is now at the tail of the system block. The length-based heuristic in kv_cache_store_len produces a cut below the variable bytes, and cross-session cache hits start working without any other knob changes.

What this changes

  • New CLI flag --tools-prepend-system. Off by default. Adds a bool tools_prepend_system field on server_config and struct server.
  • render_chat_prompt_text gains a bool tools_prepend_system parameter that selects whether the tool block is appended (default) or prepended.
  • Named macros TOOLS_AFTER_SYSTEM and TOOLS_BEFORE_SYSTEM are used at call sites so the position argument reads naturally next to the existing DS4_THINK_* constants instead of as a bare boolean.
  • A forward-declared accessor server_tools_prepend_system(const server *s) lets parsers read the flag before the full struct server definition is in scope, matching the existing pattern used by tool_memory_attach_to_messages and friends.
  • Startup logs tool schemas rendered before client system content when the flag is active, so operators can confirm the flag took effect without running a request first.

Behavior

  • Off by default. Existing behavior is unchanged when the flag is not passed.
  • When enabled, the auto-injected tool block is placed at the start of the system content. All other rendering (BOS, role markers, thinking tags, tool-result rendering) is unchanged.
  • Has no effect on requests without tools (no tool block to position).
  • Has no effect on raw /v1/completions requests (no chat-template tool injection runs).

Benchmark

Workload: an agent with a 16K-token system prompt that ends with a small dynamic tail (a Conversation started: … block of roughly 50 tokens immediately before the user message). Two back-to-back sessions with clean context.

Server launched with --kv-cache-boundary-trim-tokens 1000 --kv-cache-boundary-align-tokens 2048 --tools-prepend-system and the disk cache directory unmodified between the two sessions.

Metric First request (cold) Second request (cache hit) Saved
Prefill wall 45.7 s 6.9 s 38.8 s
Prefill tokens 16081 1703 14378 tokens served from cache
Total request ~47 s ~26 s ~21 s
Cache file written reused one file serves every subsequent session

Without --tools-prepend-system, the same two-session test produces zero cross-session cache hits. The cold cut lands above the dynamic tail (because the appended tool block sits between the tail and the user marker), the SHA always differs across sessions, and prompt done reports the full ~45 s on every request.

Tests

  • test_render_chat_prompt_text_tools_prepend_system (new) asserts both renderings contain the same set of marker strings and that the tool block precedes the client system content when the flag is true and follows it when the flag is false.
  • All existing tests pass; call sites updated to pass TOOLS_AFTER_SYSTEM explicitly so behavior is unchanged when the flag is off.

Run with:

make test
# or, to run just the model-free tests:
./ds4_test --server

Usage

./ds4-server [other flags] --tools-prepend-system

Startup will print:

ds4-server: tool schemas rendered before client system content

Recommended companion flags for tool-using agents with a small dynamic tail:

./ds4-server [other flags] \
  --kv-disk-dir /path/to/cache \
  --kv-disk-space-mb 8192 \
  --kv-cache-boundary-align-tokens 2048 \
  --kv-cache-boundary-trim-tokens 1000 \
  --tools-prepend-system

The trim of 1000 with align of 2048 lands the cold cut at a prefill-batch-aligned position safely below typical dynamic tails. Adjust trim downward if the dynamic tail is smaller, or upward if it is larger.

ds4 auto-injects "## Tools" and "### Available Tool Schemas"
instructions into the rendered system prompt whenever a request
includes tools. That block is appended after the client's own system
content, which puts any dynamic tail in the client's system message
(for example a per-request timestamp emitted by an agent runtime) in
the middle of the system region rather than at the end. Length-based
KV cache cuts then land inside the variable bytes and cross-session
lookups always miss.

Add --tools-prepend-system (off by default). When set, the auto-
injected tool block is rendered at the start of the system content
instead of after it. The model still sees the same content in the
same role; only the order changes. The client's dynamic tail becomes
the last thing before <|User|>, so --kv-cache-boundary-trim-tokens
can chop just those bytes and keep the entire tool-schema region in
the cached prefix.

Two named macros TOOLS_AFTER_SYSTEM and TOOLS_BEFORE_SYSTEM keep call
sites readable next to the existing DS4_THINK_* constants.

Startup logs "tool schemas rendered before client system content"
when the flag is active so operators can verify it without running
a request.

Includes a unit test asserting both renderings contain the same set
of marker strings and that the tool block sits before or after the
client system content according to the flag.
@unsaltedbutter-ai
Copy link
Copy Markdown
Author

I found this one while testing #66 inside hermes. I was confused why Hermes new-session prompts weren't being cached. Yes, date-time was one, but even when we trimmed off enough to go past the date-time suffix, I was still not cache hitting. Dumped the raw query and saw that ds4 was appending the tool summary to the system prompt, putting the date-time near the middle of the system prompt, so now trimming xxx bytes off of the prompt wasn't enough cache hit on the new-session prompt. By prepending the tool block to the system prompt, we have a larger fully-static section and the chop can be smaller and still cache hit. There is a pretty big performance win on Hermes Agent.

@unsaltedbutter-ai
Copy link
Copy Markdown
Author

@antirez prepending the tools block that ds4 generates let's ds4 cache-hit cold queries from agents like Hermes. They send up tools (which currently get appended) and somewhere in the middle is the current date/time. The date/time triggers a cache miss, but when the tools block is prepended and we trim a few tokens off the tail of the request, we wind up with a clean cache hit on all cold requests from agents. I've seen tens of seconds saved on the start of a new conversation.

If we combine this feature with #66, caching only the system prompt on cold, we can use a smaller --kv-cache-boundary-trim-tokens to cache more of the cold, saving a bit more time.

I made this and #66 both opt-in with a new flag, but if you'd prefer either of these to be default behavior, let me know and I'll rework the PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant