fix(router): production-ready request router + auto-size batch for embedding/rerank#10104
Open
richiejp wants to merge 2 commits into
Open
fix(router): production-ready request router + auto-size batch for embedding/rerank#10104richiejp wants to merge 2 commits into
richiejp wants to merge 2 commits into
Conversation
Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
…dels Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Collaborator
Author
|
llama-cpp smoke failure is caused by adding a new test to it that requires a fix to llama-cpp that is in this pr, but not the existing image |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the LLM-based request router production-ready and fixes a batch-sizing bug that silently capped embedding/rerank inputs at 512 tokens.
Batch sizing (embeddings & rerank)
Embedding and rerank models pool over the whole input in a single physical batch (
n_ubatch). Withn_batchleft at the 512 default, the backend rejected longer inputs with"input is too large to process"— silently capping an 8k/32k-context embedder at 512 tokens.n_batchis now sized up to the context window for embedding/rerank/score usecases; an explicitbatch:still wins.EffectiveContextSize/EffectiveBatchSizeextracted fromgrpcModelOptsas the single source of truth for the effective decode window.Router production-readiness
n_batch, so long probes can't overflow the backend.chat_messagetemplate is a hard error at router build time rather than a runtime surprise.ModelConfigper call, so a model installed after startup no longer binds a stubBackend=""config and fall into the loader's auto-iterate path.vector_storetrace is recorded on every Search/Insert — including the backend-load-failure path that previously vanished into a log warning — with outcome tagging (hit/miss/empty_store/backend_load_error/…). Misleadingsimilarity:0/input_tokens_count:0dropped from non-hit and text-mode traces.Other fixes
TokenizeStringnow reads the correctpromptJSON key (the original tokenize bug);ModelTokenizenil-guard.local-store-developmentaliases tolocal-storeso the master image satisfies embedding-cache lookups.route_local→allow(REST/UI/docs in sync); model-editor small-screen layout and several config-editor template/dropdown fixes.Tests