feat(search): configurable remote embedding provider for codegraph embed#1716
Conversation
…h embed Lets codegraph embed call a self-hosted or third-party OpenAI-compatible /embeddings endpoint instead of only the bundled local model. Set embeddings.provider: "openai" and llm.baseUrl to point at any server implementing that request/response shape (text-embeddings-inference, Ollama, LM Studio, vLLM, etc.) — reuses the existing llm.apiKey/apiKeyCommand secret resolution. No new npm dependencies; built on Node's global fetch. README.md and docs/guides/configuration.md updates land in a follow-up commit on this branch alongside the search-side fix (docs check acknowledged). Closes #1713 Impact: 19 functions changed, 13 affected
…ote embedding provider codegraph search (semantic/hybrid modes) and the semantic_search MCP tool always embedded the query text with the local model, even when the stored embeddings were built via a remote provider. That produced a dimension mismatch or nonsense similarity scores instead of actually querying the remote endpoint. Query embedding now uses the same provider that produced the stored embeddings. docs check acknowledged — README.md/configuration.md land in the next commit. Impact: 5 functions changed, 9 affected
Covers embeddings.provider/llm.baseUrl config, the CODEGRAPH_LLM_BASE_URL env override, and that codegraph search auto-routes queries through the same remote endpoint.
Greptile SummaryThis PR adds an OpenAI-compatible remote embedding provider to
Confidence Score: 5/5Safe to merge. The remote provider path is well-isolated, and all three concerns raised in earlier review rounds have been properly addressed in the final code. The new remote embedding path is additive and gated entirely behind an opt-in config field. The critical correctness concern — routing search-time query embedding to the same backend that produced the stored vectors — is handled by reading storedProvider from embedding_meta rather than the live config. Validation at embed time (provider + model required together), error wrapping throughout embedRemote, and thorough integration test coverage (config-drift regression, full build+search round-trip, timeout abort, dimension consistency) give high confidence the feature behaves correctly across its advertised scenarios. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant CLI as codegraph embed/search
participant Config as infrastructure/config
participant Gen as generator.ts
participant Remote as providers/remote.ts
participant Endpoint as Remote /embeddings
participant DB as graph.db (embedding_meta)
participant Semantic as search/semantic.ts
Note over CLI,DB: embed path (provider="openai")
CLI->>Config: loadConfig()
Config-->>CLI: "{embeddings.provider="openai", llm.baseUrl, llm.apiKey}"
CLI->>Gen: "buildEmbeddings(root, model, db, {remote: opts})"
Gen->>Remote: "embedRemote(texts, {baseUrl, model, apiKey, timeoutMs})"
loop batches of 32
Remote->>Endpoint: "POST /embeddings {model, input:[...]}"
Endpoint-->>Remote: "{data:[{embedding:[...], index:N}]}"
end
Remote-->>Gen: "{vectors: Float32Array[], dim}"
Gen->>DB: "persistEmbeddings(..., provider="openai")"
DB-->>Gen: "stored embedding_meta: {model, dim, provider="openai"}"
Note over CLI,Semantic: search path
CLI->>Semantic: "searchData(query, dbPath, {config})"
Semantic->>DB: prepareSearch - reads embedding_meta.provider
DB-->>Semantic: "storedProvider="openai", storedModel="my-model""
Semantic->>Remote: embedQuery - embedRemote([query], resolveRemoteEmbeddingOptions(config, model))
Remote->>Endpoint: "POST /embeddings {model, input:[query]}"
Endpoint-->>Remote: "{data:[{embedding:[...], index:0}]}"
Remote-->>Semantic: "{vectors:[queryVec], dim}"
Semantic-->>CLI: ranked similarity results
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant CLI as codegraph embed/search
participant Config as infrastructure/config
participant Gen as generator.ts
participant Remote as providers/remote.ts
participant Endpoint as Remote /embeddings
participant DB as graph.db (embedding_meta)
participant Semantic as search/semantic.ts
Note over CLI,DB: embed path (provider="openai")
CLI->>Config: loadConfig()
Config-->>CLI: "{embeddings.provider="openai", llm.baseUrl, llm.apiKey}"
CLI->>Gen: "buildEmbeddings(root, model, db, {remote: opts})"
Gen->>Remote: "embedRemote(texts, {baseUrl, model, apiKey, timeoutMs})"
loop batches of 32
Remote->>Endpoint: "POST /embeddings {model, input:[...]}"
Endpoint-->>Remote: "{data:[{embedding:[...], index:N}]}"
end
Remote-->>Gen: "{vectors: Float32Array[], dim}"
Gen->>DB: "persistEmbeddings(..., provider="openai")"
DB-->>Gen: "stored embedding_meta: {model, dim, provider="openai"}"
Note over CLI,Semantic: search path
CLI->>Semantic: "searchData(query, dbPath, {config})"
Semantic->>DB: prepareSearch - reads embedding_meta.provider
DB-->>Semantic: "storedProvider="openai", storedModel="my-model""
Semantic->>Remote: embedQuery - embedRemote([query], resolveRemoteEmbeddingOptions(config, model))
Remote->>Endpoint: "POST /embeddings {model, input:[query]}"
Endpoint-->>Remote: "{data:[{embedding:[...], index:0}]}"
Remote-->>Semantic: "{vectors:[queryVec], dim}"
Semantic-->>CLI: ranked similarity results
Reviews (8): Last reviewed commit: "test: verify requestTimeoutMs is actuall..." | Re-trigger Greptile |
| for (let i = 0; i < texts.length; i += REMOTE_BATCH_SIZE) { | ||
| const batch = texts.slice(i, i + REMOTE_BATCH_SIZE); | ||
|
|
||
| let response: Response; | ||
| try { | ||
| response = await fetch(url, { | ||
| method: 'POST', | ||
| headers, | ||
| body: JSON.stringify({ model: options.model, input: batch }), | ||
| }); | ||
| } catch (err: unknown) { | ||
| throw new EngineError( | ||
| `Failed to reach remote embedding endpoint at ${url}: ${err instanceof Error ? err.message : String(err)}`, | ||
| { cause: err instanceof Error ? err : undefined }, | ||
| ); | ||
| } |
There was a problem hiding this comment.
No request timeout on remote fetch calls
The fetch calls carry no AbortController timeout. If a self-hosted server becomes unresponsive mid-run (e.g., OOM-killed while processing a large batch), the process hangs indefinitely — there's no way for embedRemote to surface a timeout error and no user-visible progress to indicate the hang. For large codebases this is especially risky because many sequential batches are issued; a hang on batch 20 of 50 leaves nothing actionable for the user.
Consider wrapping each fetch with a scoped AbortController with a configurable or hard-coded ceiling (e.g., 120 s per batch), and throwing an EngineError with the elapsed time on abort.
There was a problem hiding this comment.
Fixed in 91eb0da — each batch request is now wrapped in an AbortController with a configurable timeout (llm.requestTimeoutMs, default 120s). On abort, embedRemote throws an EngineError naming the endpoint, elapsed timeout, and which batch stalled, instead of hanging indefinitely. Added a test that simulates a hung request and asserts the abort/error path.
| const sorted = [...json.data].sort((a, b) => a.index - b.index); | ||
| for (const item of sorted) { | ||
| const vec = Float32Array.from(item.embedding); | ||
| if (dim === 0) dim = vec.length; | ||
| results.push(vec); | ||
| } |
There was a problem hiding this comment.
Vector dimension not validated across batches
dim is captured from the first vector of the first batch (if (dim === 0) dim = vec.length) and never checked again. If a misbehaving server returns embeddings of different lengths in a later batch, the returned dim value is wrong — multi-dimensional cosine similarity later will silently produce garbage or throw a TypedArray range error far from the source. Adding a consistency assert per vector after the first is a cheap guard that makes the failure message actionable.
There was a problem hiding this comment.
Fixed in 91eb0da — every vector after the first is now checked against the batch's established dim; a mismatch throws an EngineError naming the expected/actual dimensions and the offending response item index, instead of silently keeping the wrong dim. Added a test with a response containing mixed-length embeddings.
| async function embedQuery( | ||
| texts: string[], | ||
| config: CodegraphConfig, | ||
| modelKey: string | null, | ||
| storedModel: string | null, | ||
| ): Promise<{ vectors: Float32Array[]; dim: number }> { | ||
| const isKnownLocalModel = modelKey != null && modelKey in MODELS; | ||
| if (!isKnownLocalModel && config.embeddings?.provider === 'openai') { | ||
| const remoteModel = modelKey || storedModel; | ||
| if (remoteModel) { | ||
| return embedRemote(texts, resolveRemoteEmbeddingOptions(config, remoteModel)); | ||
| } | ||
| } | ||
| return embed(texts, modelKey ?? undefined); |
There was a problem hiding this comment.
Search routing uses live config rather than stored embed-time provider
embedQuery decides which backend to call by inspecting config.embeddings?.provider at search time, not by reading what was recorded in embedding_meta at embed time. This creates a silent-mismatch window:
- If a user embeds with
provider: "openai"then later removes that field (e.g., on a CI machine),isKnownLocalModelstays false andremoteModelis the stored name — butconfig.embeddings?.provider !== 'openai', so the call falls through to the local model. The dimension check (checkDimensionMismatch) will catch this when dimensions differ, but if the local and remote models happen to share the same output dimension (both 768d, for instance), search silently computes cosine similarity between incompatible vector spaces, returning misleading ranked results.
Recording the provider value ("openai" / null) in embedding_meta at persist time and reading it back in prepareSearch would make routing fully deterministic regardless of config drift.
There was a problem hiding this comment.
Fixed in c383e9b — the provider that produced the stored embeddings ("openai" or local) is now recorded in embedding_meta at embed time and threaded through prepareSearch as storedProvider. embedQuery routes off that value instead of the live config.embeddings.provider, so a config drift after embed ran can no longer cause a silent fallback to the local model. llm.baseUrl/apiKey are still resolved from the live config — only the routing decision itself is now embed-time-sourced. Added a regression test that rebuilds embeddings remotely, drifts the search-time config back to local, and asserts the query is still routed to the remote endpoint.
Codegraph Impact Analysis28 functions changed → 16 callers affected across 9 files
|
…ider The llm.* fields (baseUrl, apiKey, apiKeyCommand) are now reused by embeddings.provider: "openai", not just chat-completion features — the LLM credentials section didn't mention that reuse.
| const sorted = [...json.data].sort((a, b) => a.index - b.index); | ||
| for (const item of sorted) { | ||
| const vec = Float32Array.from(item.embedding); | ||
| if (dim === 0) dim = vec.length; | ||
| results.push(vec); | ||
| } |
There was a problem hiding this comment.
Missing embedding field validation before
Float32Array.from()
The code checks that json.data.length === batch.length, but doesn't verify that each item actually contains a valid embedding array. If a server returns a conforming-length response where one entry is missing the embedding field (e.g. { "index": 0 } with no embedding), Float32Array.from(undefined) throws a raw TypeError: undefined is not iterable rather than a wrapped EngineError. This produces a confusing and untraceable failure for the user.
| const sorted = [...json.data].sort((a, b) => a.index - b.index); | |
| for (const item of sorted) { | |
| const vec = Float32Array.from(item.embedding); | |
| if (dim === 0) dim = vec.length; | |
| results.push(vec); | |
| } | |
| const sorted = [...json.data].sort((a, b) => a.index - b.index); | |
| for (const item of sorted) { | |
| if (!Array.isArray(item.embedding)) { | |
| throw new EngineError( | |
| `Remote embedding endpoint ${url} returned an item with a missing or non-array "embedding" field (index ${item.index})`, | |
| ); | |
| } | |
| const vec = Float32Array.from(item.embedding); | |
| if (dim === 0) dim = vec.length; | |
| results.push(vec); | |
| } |
There was a problem hiding this comment.
Good catch — fixed in f1030ad. Added an explicit Array.isArray(item.embedding) check before Float32Array.from() that throws a wrapped EngineError naming the offending item's index, instead of letting a malformed entry surface as a raw TypeError: undefined is not iterable. Added a test with a response item missing the embedding field.
…te embedding provider (#1716) Greptile review: embedRemote's fetch calls had no timeout, so an unresponsive self-hosted server would hang the process indefinitely with no actionable error. Wrap each batch request in an AbortController with a configurable ceiling (llm.requestTimeoutMs, default 120s) and throw an EngineError describing the elapsed time on abort. Also validate that every vector in a response has the same dimension as the first one seen; a misbehaving server that returns mixed-length embeddings across (or within) a batch now fails fast with a clear error instead of silently corrupting `dim` and producing garbage similarity scores or a TypedArray range error far from the source. Impact: 4 functions changed, 9 affected
…ive config (#1716) Greptile review: embedQuery() decided which backend to call by inspecting config.embeddings?.provider at search time rather than what actually produced the stored embeddings. If a DB was embedded with provider: "openai" and the config later drifted (e.g. the field was cleared on a different machine), routing silently fell back to the local model. When the local and remote models happen to share an output dimension, the existing dimension-mismatch guard wouldn't catch it, and search would compute cosine similarity between incompatible vector spaces without any error. Record the provider ("openai" or omitted for local) in embedding_meta at persist time, thread it through prepareSearch as `storedProvider`, and key embedQuery's routing decision off that instead of the live config. The live config is still used to resolve where to send the request (llm.baseUrl/apiKey), just not whether to send it there at all. Impact: 7 functions changed, 11 affected
…e provider (#1716) Greptile review: the response-shape check only verified json.data.length matched the batch size, not that each item actually carried a valid embedding array. A conforming-length response with one malformed entry (e.g. { "index": 0 } with no embedding) made Float32Array.from(undefined) throw a raw TypeError instead of the wrapped EngineError used everywhere else in this module, producing a confusing, untraceable failure. Impact: 1 functions changed, 9 affected
…cal switch Verifies that a full local-model rebuild never inherits a stale 'openai' provider marker from a prior remote-provider build, since buildEmbeddings always wipes embedding_meta before repopulating it.
|
Re: the "stale provider row" finding in the summary ( This isn't actually reachable: Added |
codegraph models has no validate() gate like embed does, so a config with embeddings.provider set but embeddings.model unset would interpolate the null model straight into the banner text. Impact: 2 functions changed, 0 affected
…options fakeCtx's llm defaults omitted requestTimeoutMs, so timeoutMs was undefined on both sides of the toEqual comparison and the assertion passed without ever checking the value was threaded through.
|
Addressed both minor items from the latest summary:
|
Summary
codegraph embedcan now call a self-hosted or third-party OpenAI-compatible/embeddingsendpoint instead of only the bundled local model — setembeddings.provider: "openai"andllm.baseUrl(works with text-embeddings-inference, Ollama, LM Studio, vLLM, or OpenAI itself; reuses the existingllm.apiKey/apiKeyCommandsecret resolution). No new npm dependencies — built on Node's globalfetch.codegraph search(semantic/hybrid modes) and thesemantic_searchMCP tool now embed the query through the same remote provider that produced the stored embeddings, instead of always falling back to the local model (would otherwise silently produce a dimension mismatch or nonsense scores).README.mdanddocs/guides/configuration.md, including the newCODEGRAPH_LLM_BASE_URLenv override.Test plan
npm test— full suite (197 files / 3320 tests) passesnpm run lint/tsc --noEmit— cleancodegraph embed,codegraph search(semantic/hybrid/keyword),codegraph models, andcodegraph config --explainagainst a real local HTTP server implementing the OpenAI-compatible/embeddingsshape — confirmed correct request bodies, auth header, model name propagation, and that search results route through the remote endpoint end-to-endCloses #1713