server: reserve context budget for DSML tool calls#105
Draft
gmontana wants to merge 1 commit into
Draft
Conversation
For tool-enabled DeepSeek V4 chat requests, keep ordinary text generation out of the final 256 context tokens and allow that reserve only while a DSML tool call or partial tool-start marker is in progress. This mitigates antirez#48 without adding constrained decoding. Oversized tool arguments can still reach the hard context limit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mitigates #48 for the ds4 DeepSeek V4 Flash tool-enabled chat path by reserving the last 256 decode tokens for DSML tool-call closure.
For requests with tools, ordinary text generation stops at a soft limit before the hard context limit. If generation is already inside a DSML tool call, or is at the soft limit with a partial tool-start marker at the end of the generated text, decoding can use the reserve.
Notes
This is intentionally a small server-side budget guard, not constrained decoding. Tool-enabled chats may finish with
finish_reason=lengthup to 256 tokens earlier than the hard context limit when no tool call is in progress. Oversized tool arguments can still reach the hard limit and hit the existingunterminated tool callbackstop.The KV continued-checkpoint gate is left with its previous condition; this PR only changes decode budget decisions.
Validation
git diff --checkmake ds4_test./ds4_test --servermake ds4-serverI did not add a full model-backed reproduction test; the new coverage is a focused server-unit test for the decode budget logic and soft-limit transition.