Skip to content

fix(router): production-ready request router + auto-size batch for embedding/rerank#10104

Open
richiejp wants to merge 2 commits into
mudler:masterfrom
richiejp:fix/router-config
Open

fix(router): production-ready request router + auto-size batch for embedding/rerank#10104
richiejp wants to merge 2 commits into
mudler:masterfrom
richiejp:fix/router-config

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

Summary

Makes the LLM-based request router production-ready and fixes a batch-sizing bug that silently capped embedding/rerank inputs at 512 tokens.

Batch sizing (embeddings & rerank)

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With n_batch left at the 512 default, the backend rejected longer inputs with "input is too large to process" — silently capping an 8k/32k-context embedder at 512 tokens.

  • n_batch is now sized up to the context window for embedding/rerank/score usecases; an explicit batch: still wins.
  • EffectiveContextSize/EffectiveBatchSize extracted from grpcModelOpts as the single source of truth for the effective decode window.

Router production-readiness

  • Token-accurate trimming: conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the (now context-scaled) n_batch, so long probes can't overflow the backend.
  • Fail fast on misconfiguration: a missing chat_message template is a hard error at router build time rather than a runtime surprise.
  • Late-installed models: router factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call, so a model installed after startup no longer binds a stub Backend="" config and fall into the loader's auto-iterate path.
  • Vector-store observability: a vector_store trace is recorded on every Search/Insert — including the backend-load-failure path that previously vanished into a log warning — with outcome tagging (hit/miss/empty_store/backend_load_error/…). Misleading similarity:0 / input_tokens_count:0 dropped from non-hit and text-mode traces.

Other fixes

  • llama-cpp TokenizeString now reads the correct prompt JSON key (the original tokenize bug); ModelTokenize nil-guard.
  • Gallery local-store-development aliases to local-store so the master image satisfies embedding-cache lookups.
  • Non-fatal MITM proxy startup; PII route_localallow (REST/UI/docs in sync); model-editor small-screen layout and several config-editor template/dropdown fixes.

Tests

  • e2e router specs (classification + long-conversation trim) and an e2e-aio regression embedding a >512-token input (AIO embedding model switched to nomic-embed-text-v1.5 for a context large enough to exercise the batch fix).
  • Unit coverage for context/batch sizing, lazy factories, vector_store traces, gallery dev-alias resolution; Playwright trace-badge + scroll regressions.

richiejp added 2 commits May 31, 2026 08:35
Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.

New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.

Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.

Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.

Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…dels

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.

Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.

Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp
Copy link
Copy Markdown
Collaborator Author

llama-cpp smoke failure is caused by adding a new test to it that requires a fix to llama-cpp that is in this pr, but not the existing image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant