feat(server): optionally prepend tool schemas to the system prompt#68
feat(server): optionally prepend tool schemas to the system prompt#68unsaltedbutter-ai wants to merge 1 commit into
Conversation
ds4 auto-injects "## Tools" and "### Available Tool Schemas" instructions into the rendered system prompt whenever a request includes tools. That block is appended after the client's own system content, which puts any dynamic tail in the client's system message (for example a per-request timestamp emitted by an agent runtime) in the middle of the system region rather than at the end. Length-based KV cache cuts then land inside the variable bytes and cross-session lookups always miss. Add --tools-prepend-system (off by default). When set, the auto- injected tool block is rendered at the start of the system content instead of after it. The model still sees the same content in the same role; only the order changes. The client's dynamic tail becomes the last thing before <|User|>, so --kv-cache-boundary-trim-tokens can chop just those bytes and keep the entire tool-schema region in the cached prefix. Two named macros TOOLS_AFTER_SYSTEM and TOOLS_BEFORE_SYSTEM keep call sites readable next to the existing DS4_THINK_* constants. Startup logs "tool schemas rendered before client system content" when the flag is active so operators can verify it without running a request. Includes a unit test asserting both renderings contain the same set of marker strings and that the tool block sits before or after the client system content according to the flag.
|
I found this one while testing #66 inside hermes. I was confused why Hermes new-session prompts weren't being cached. Yes, date-time was one, but even when we trimmed off enough to go past the date-time suffix, I was still not cache hitting. Dumped the raw query and saw that ds4 was appending the tool summary to the system prompt, putting the date-time near the middle of the system prompt, so now trimming xxx bytes off of the prompt wasn't enough cache hit on the new-session prompt. By prepending the tool block to the system prompt, we have a larger fully-static section and the chop can be smaller and still cache hit. There is a pretty big performance win on Hermes Agent. |
|
@antirez prepending the tools block that ds4 generates let's ds4 cache-hit cold queries from agents like Hermes. They send up tools (which currently get appended) and somewhere in the middle is the current date/time. The date/time triggers a cache miss, but when the tools block is prepended and we trim a few tokens off the tail of the request, we wind up with a clean cache hit on all cold requests from agents. I've seen tens of seconds saved on the start of a new conversation. If we combine this feature with #66, caching only the system prompt on cold, we can use a smaller I made this and #66 both opt-in with a new flag, but if you'd prefer either of these to be default behavior, let me know and I'll rework the PRs. |
Summary
Adds
--tools-prepend-system, an opt-in flag that renders ds4's auto-injected tool boilerplate at the start of the system prompt instead of after the client's system content. The model sees the same content in the same role; only the order changes. With the flag enabled, the client's system content (including any dynamic tail it contains) sits immediately before<|User|>, so the existing--kv-cache-boundary-trim-tokensknob can chop just those bytes and keep the tool-schema region in the cached prefix.Motivation
render_chat_prompt_textinjects## Toolsand### Available Tool Schemasinstructions into every prompt whose request carries atoolsfield. Today that block is appended after the client's own system content, producing this layout:For tool-using agents whose system message has a small dynamic field, the variable bytes sit between two stable regions. There is no length-based cache cut that excludes the dynamic content while including any of the tool schemas. Every cross-session lookup misses for the same structural reason regardless of how
--kv-cache-boundary-trim-tokensis tuned.For example, a Hermes-style agent typically emits a few lines at the end of its system prompt summarizing the session context:
The first line varies between sessions (timestamp) while the rest is stable. Under the current layout that block lives in the middle of the rendered system region, with the much larger tool-schema region after it. There is no horizontal cut that captures any tool schemas without also capturing the variable timestamp, so the KV cache cannot be reused across sessions.
With
--tools-prepend-system, the layout becomes:The dynamic content is now at the tail of the system block. The length-based heuristic in
kv_cache_store_lenproduces a cut below the variable bytes, and cross-session cache hits start working without any other knob changes.What this changes
--tools-prepend-system. Off by default. Adds abool tools_prepend_systemfield onserver_configandstruct server.render_chat_prompt_textgains abool tools_prepend_systemparameter that selects whether the tool block is appended (default) or prepended.TOOLS_AFTER_SYSTEMandTOOLS_BEFORE_SYSTEMare used at call sites so the position argument reads naturally next to the existingDS4_THINK_*constants instead of as a bare boolean.server_tools_prepend_system(const server *s)lets parsers read the flag before the fullstruct serverdefinition is in scope, matching the existing pattern used bytool_memory_attach_to_messagesand friends.tool schemas rendered before client system contentwhen the flag is active, so operators can confirm the flag took effect without running a request first.Behavior
/v1/completionsrequests (no chat-template tool injection runs).Benchmark
Workload: an agent with a 16K-token system prompt that ends with a small dynamic tail (a
Conversation started: …block of roughly 50 tokens immediately before the user message). Two back-to-back sessions with clean context.Server launched with
--kv-cache-boundary-trim-tokens 1000 --kv-cache-boundary-align-tokens 2048 --tools-prepend-systemand the disk cache directory unmodified between the two sessions.Without
--tools-prepend-system, the same two-session test produces zero cross-session cache hits. The cold cut lands above the dynamic tail (because the appended tool block sits between the tail and the user marker), the SHA always differs across sessions, andprompt donereports the full ~45 s on every request.Tests
test_render_chat_prompt_text_tools_prepend_system(new) asserts both renderings contain the same set of marker strings and that the tool block precedes the client system content when the flag is true and follows it when the flag is false.TOOLS_AFTER_SYSTEMexplicitly so behavior is unchanged when the flag is off.Run with:
Usage
Startup will print:
Recommended companion flags for tool-using agents with a small dynamic tail:
The trim of 1000 with align of 2048 lands the cold cut at a prefill-batch-aligned position safely below typical dynamic tails. Adjust trim downward if the dynamic tail is smaller, or upward if it is larger.