Skip to content

feat(search): configurable remote embedding provider for codegraph embed#1716

Merged
carlos-alm merged 10 commits into
mainfrom
feat/remote-embedding-provider
Jul 2, 2026
Merged

feat(search): configurable remote embedding provider for codegraph embed#1716
carlos-alm merged 10 commits into
mainfrom
feat/remote-embedding-provider

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

  • codegraph embed can now call a self-hosted or third-party OpenAI-compatible /embeddings endpoint instead of only the bundled local model — set embeddings.provider: "openai" and llm.baseUrl (works with text-embeddings-inference, Ollama, LM Studio, vLLM, or OpenAI itself; reuses the existing llm.apiKey/apiKeyCommand secret resolution). No new npm dependencies — built on Node's global fetch.
  • codegraph search (semantic/hybrid modes) and the semantic_search MCP tool now embed the query through the same remote provider that produced the stored embeddings, instead of always falling back to the local model (would otherwise silently produce a dimension mismatch or nonsense scores).
  • Docs updated in README.md and docs/guides/configuration.md, including the new CODEGRAPH_LLM_BASE_URL env override.

Test plan

  • npm test — full suite (197 files / 3320 tests) passes
  • npm run lint / tsc --noEmit — clean
  • Manual smoke test: ran codegraph embed, codegraph search (semantic/hybrid/keyword), codegraph models, and codegraph config --explain against a real local HTTP server implementing the OpenAI-compatible /embeddings shape — confirmed correct request bodies, auth header, model name propagation, and that search results route through the remote endpoint end-to-end
  • Unit/integration tests added for the new provider module, config env override, CLI validation, and the search-side dispatch fix

Closes #1713

…h embed

Lets codegraph embed call a self-hosted or third-party OpenAI-compatible
/embeddings endpoint instead of only the bundled local model. Set
embeddings.provider: "openai" and llm.baseUrl to point at any server
implementing that request/response shape (text-embeddings-inference, Ollama,
LM Studio, vLLM, etc.) — reuses the existing llm.apiKey/apiKeyCommand secret
resolution. No new npm dependencies; built on Node's global fetch.

README.md and docs/guides/configuration.md updates land in a follow-up
commit on this branch alongside the search-side fix (docs check acknowledged).

Closes #1713

Impact: 19 functions changed, 13 affected
…ote embedding provider

codegraph search (semantic/hybrid modes) and the semantic_search MCP tool
always embedded the query text with the local model, even when the stored
embeddings were built via a remote provider. That produced a dimension
mismatch or nonsense similarity scores instead of actually querying the
remote endpoint. Query embedding now uses the same provider that produced
the stored embeddings.

docs check acknowledged — README.md/configuration.md land in the next commit.

Impact: 5 functions changed, 9 affected
Covers embeddings.provider/llm.baseUrl config, the CODEGRAPH_LLM_BASE_URL
env override, and that codegraph search auto-routes queries through the
same remote endpoint.
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an OpenAI-compatible remote embedding provider to codegraph embed and codegraph search/semantic_search MCP tool. When embeddings.provider: "openai" is set, calls are routed to any configurable llm.baseUrl endpoint (self-hosted or cloud) instead of the bundled local @huggingface/transformers model.

  • Adds src/domain/search/providers/remote.ts — a batched, timeout-guarded HTTP client for /embeddings, with all three issues raised in prior review rounds addressed (AbortController timeout, cross-batch dimension consistency, and embedding field validation).
  • Persists the embed-time provider ("openai") in embedding_meta so searchData/multiSearchData route the query vector through the same backend, even if the live config has changed since embed ran.
  • Adds CODEGRAPH_LLM_BASE_URL env override and llm.requestTimeoutMs config key; updates docs and tests comprehensively (4 new integration test suites, 197-file suite continues to pass).

Confidence Score: 5/5

Safe to merge. The remote provider path is well-isolated, and all three concerns raised in earlier review rounds have been properly addressed in the final code.

The new remote embedding path is additive and gated entirely behind an opt-in config field. The critical correctness concern — routing search-time query embedding to the same backend that produced the stored vectors — is handled by reading storedProvider from embedding_meta rather than the live config. Validation at embed time (provider + model required together), error wrapping throughout embedRemote, and thorough integration test coverage (config-drift regression, full build+search round-trip, timeout abort, dimension consistency) give high confidence the feature behaves correctly across its advertised scenarios.

No files require special attention.

Important Files Changed

Filename Overview
src/domain/search/providers/remote.ts New core module: batched HTTP client for OpenAI-compatible /embeddings endpoints. Handles AbortController timeout per batch, response-order sorting by index, per-vector dimension consistency check, and missing-embedding-field guard — all three prior review concerns are addressed.
src/domain/search/search/semantic.ts Adds embedQuery() that routes query vectors via storedProvider (embed-time truth from embedding_meta) rather than the live config, preventing silent local-model fallback after config drift. Both searchData and multiSearchData use it.
src/domain/search/generator.ts Routes embed calls to embedRemote or local embed based on options.remote; persists provider="openai" in embedding_meta when remote was used; uses DEFAULT_REMOTE_CONTEXT_WINDOW for unknown remote models.
src/domain/search/search/prepare.ts Extends PreparedSearch with storedModel and storedProvider read from embedding_meta; threads both into the return value for use by embedQuery routing.
src/infrastructure/config.ts Adds embeddings.provider and llm.requestTimeoutMs to DEFAULTS; adds CODEGRAPH_LLM_BASE_URL to env override map. Changes are additive and non-breaking.
src/cli/commands/embed.ts Adds validate() check for unsupported provider values and missing model when provider is set; resolves RemoteEmbeddingOptions and passes to buildEmbeddings when provider="openai".
tests/search/embedding-remote-provider.test.ts Comprehensive unit tests for embedRemote and resolveRemoteEmbeddingOptions: covers happy path, index sorting, batching, auth header omission, non-2xx errors, network failure, timeout abort, dimension mismatch, and missing embedding field.
tests/search/embedding-remote-search.test.ts Integration tests verifying end-to-end remote routing for searchData: confirms fetch is called for both index and query steps, and that routing still goes remote when the live embeddings.provider config drifts to null after embed ran.
src/types.ts Adds provider: string

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as codegraph embed/search
    participant Config as infrastructure/config
    participant Gen as generator.ts
    participant Remote as providers/remote.ts
    participant Endpoint as Remote /embeddings
    participant DB as graph.db (embedding_meta)
    participant Semantic as search/semantic.ts

    Note over CLI,DB: embed path (provider="openai")
    CLI->>Config: loadConfig()
    Config-->>CLI: "{embeddings.provider="openai", llm.baseUrl, llm.apiKey}"
    CLI->>Gen: "buildEmbeddings(root, model, db, {remote: opts})"
    Gen->>Remote: "embedRemote(texts, {baseUrl, model, apiKey, timeoutMs})"
    loop batches of 32
        Remote->>Endpoint: "POST /embeddings {model, input:[...]}"
        Endpoint-->>Remote: "{data:[{embedding:[...], index:N}]}"
    end
    Remote-->>Gen: "{vectors: Float32Array[], dim}"
    Gen->>DB: "persistEmbeddings(..., provider="openai")"
    DB-->>Gen: "stored embedding_meta: {model, dim, provider="openai"}"

    Note over CLI,Semantic: search path
    CLI->>Semantic: "searchData(query, dbPath, {config})"
    Semantic->>DB: prepareSearch - reads embedding_meta.provider
    DB-->>Semantic: "storedProvider="openai", storedModel="my-model""
    Semantic->>Remote: embedQuery - embedRemote([query], resolveRemoteEmbeddingOptions(config, model))
    Remote->>Endpoint: "POST /embeddings {model, input:[query]}"
    Endpoint-->>Remote: "{data:[{embedding:[...], index:0}]}"
    Remote-->>Semantic: "{vectors:[queryVec], dim}"
    Semantic-->>CLI: ranked similarity results
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as codegraph embed/search
    participant Config as infrastructure/config
    participant Gen as generator.ts
    participant Remote as providers/remote.ts
    participant Endpoint as Remote /embeddings
    participant DB as graph.db (embedding_meta)
    participant Semantic as search/semantic.ts

    Note over CLI,DB: embed path (provider="openai")
    CLI->>Config: loadConfig()
    Config-->>CLI: "{embeddings.provider="openai", llm.baseUrl, llm.apiKey}"
    CLI->>Gen: "buildEmbeddings(root, model, db, {remote: opts})"
    Gen->>Remote: "embedRemote(texts, {baseUrl, model, apiKey, timeoutMs})"
    loop batches of 32
        Remote->>Endpoint: "POST /embeddings {model, input:[...]}"
        Endpoint-->>Remote: "{data:[{embedding:[...], index:N}]}"
    end
    Remote-->>Gen: "{vectors: Float32Array[], dim}"
    Gen->>DB: "persistEmbeddings(..., provider="openai")"
    DB-->>Gen: "stored embedding_meta: {model, dim, provider="openai"}"

    Note over CLI,Semantic: search path
    CLI->>Semantic: "searchData(query, dbPath, {config})"
    Semantic->>DB: prepareSearch - reads embedding_meta.provider
    DB-->>Semantic: "storedProvider="openai", storedModel="my-model""
    Semantic->>Remote: embedQuery - embedRemote([query], resolveRemoteEmbeddingOptions(config, model))
    Remote->>Endpoint: "POST /embeddings {model, input:[query]}"
    Endpoint-->>Remote: "{data:[{embedding:[...], index:0}]}"
    Remote-->>Semantic: "{vectors:[queryVec], dim}"
    Semantic-->>CLI: ranked similarity results
Loading

Reviews (8): Last reviewed commit: "test: verify requestTimeoutMs is actuall..." | Re-trigger Greptile

Comment on lines +76 to +91
for (let i = 0; i < texts.length; i += REMOTE_BATCH_SIZE) {
const batch = texts.slice(i, i + REMOTE_BATCH_SIZE);

let response: Response;
try {
response = await fetch(url, {
method: 'POST',
headers,
body: JSON.stringify({ model: options.model, input: batch }),
});
} catch (err: unknown) {
throw new EngineError(
`Failed to reach remote embedding endpoint at ${url}: ${err instanceof Error ? err.message : String(err)}`,
{ cause: err instanceof Error ? err : undefined },
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No request timeout on remote fetch calls

The fetch calls carry no AbortController timeout. If a self-hosted server becomes unresponsive mid-run (e.g., OOM-killed while processing a large batch), the process hangs indefinitely — there's no way for embedRemote to surface a timeout error and no user-visible progress to indicate the hang. For large codebases this is especially risky because many sequential batches are issued; a hang on batch 20 of 50 leaves nothing actionable for the user.

Consider wrapping each fetch with a scoped AbortController with a configurable or hard-coded ceiling (e.g., 120 s per batch), and throwing an EngineError with the elapsed time on abort.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 91eb0da — each batch request is now wrapped in an AbortController with a configurable timeout (llm.requestTimeoutMs, default 120s). On abort, embedRemote throws an EngineError naming the endpoint, elapsed timeout, and which batch stalled, instead of hanging indefinitely. Added a test that simulates a hung request and asserts the abort/error path.

Comment on lines +110 to +115
const sorted = [...json.data].sort((a, b) => a.index - b.index);
for (const item of sorted) {
const vec = Float32Array.from(item.embedding);
if (dim === 0) dim = vec.length;
results.push(vec);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Vector dimension not validated across batches

dim is captured from the first vector of the first batch (if (dim === 0) dim = vec.length) and never checked again. If a misbehaving server returns embeddings of different lengths in a later batch, the returned dim value is wrong — multi-dimensional cosine similarity later will silently produce garbage or throw a TypedArray range error far from the source. Adding a consistency assert per vector after the first is a cheap guard that makes the failure message actionable.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 91eb0da — every vector after the first is now checked against the batch's established dim; a mismatch throws an EngineError naming the expected/actual dimensions and the offending response item index, instead of silently keeping the wrong dim. Added a test with a response containing mixed-length embeddings.

Comment on lines +16 to +29
async function embedQuery(
texts: string[],
config: CodegraphConfig,
modelKey: string | null,
storedModel: string | null,
): Promise<{ vectors: Float32Array[]; dim: number }> {
const isKnownLocalModel = modelKey != null && modelKey in MODELS;
if (!isKnownLocalModel && config.embeddings?.provider === 'openai') {
const remoteModel = modelKey || storedModel;
if (remoteModel) {
return embedRemote(texts, resolveRemoteEmbeddingOptions(config, remoteModel));
}
}
return embed(texts, modelKey ?? undefined);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Search routing uses live config rather than stored embed-time provider

embedQuery decides which backend to call by inspecting config.embeddings?.provider at search time, not by reading what was recorded in embedding_meta at embed time. This creates a silent-mismatch window:

  • If a user embeds with provider: "openai" then later removes that field (e.g., on a CI machine), isKnownLocalModel stays false and remoteModel is the stored name — but config.embeddings?.provider !== 'openai', so the call falls through to the local model. The dimension check (checkDimensionMismatch) will catch this when dimensions differ, but if the local and remote models happen to share the same output dimension (both 768d, for instance), search silently computes cosine similarity between incompatible vector spaces, returning misleading ranked results.

Recording the provider value ("openai" / null) in embedding_meta at persist time and reading it back in prepareSearch would make routing fully deterministic regardless of config drift.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c383e9b — the provider that produced the stored embeddings ("openai" or local) is now recorded in embedding_meta at embed time and threaded through prepareSearch as storedProvider. embedQuery routes off that value instead of the live config.embeddings.provider, so a config drift after embed ran can no longer cause a silent fallback to the local model. llm.baseUrl/apiKey are still resolved from the live config — only the routing decision itself is now embed-time-sourced. Added a regression test that rebuilds embeddings remotely, drifts the search-time config back to local, and asserts the query is still routed to the remote endpoint.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Codegraph Impact Analysis

28 functions changed16 callers affected across 9 files

  • command.validate in src/cli/commands/embed.ts:52 (0 transitive callers)
  • validate in src/cli/commands/embed.ts:52 (0 transitive callers)
  • command.execute in src/cli/commands/embed.ts:70 (0 transitive callers)
  • execute in src/cli/commands/embed.ts:70 (0 transitive callers)
  • command.execute in src/cli/commands/models.ts:7 (0 transitive callers)
  • execute in src/cli/commands/models.ts:7 (0 transitive callers)
  • persistEmbeddings in src/domain/search/generator.ts:168 (2 transitive callers)
  • BuildEmbeddingsOptions.remote in src/domain/search/generator.ts:216 (0 transitive callers)
  • buildEmbeddings in src/domain/search/generator.ts:222 (1 transitive callers)
  • RemoteEmbeddingOptions.baseUrl in src/domain/search/providers/remote.ts:18 (0 transitive callers)
  • RemoteEmbeddingOptions.model in src/domain/search/providers/remote.ts:19 (0 transitive callers)
  • RemoteEmbeddingOptions.apiKey in src/domain/search/providers/remote.ts:20 (0 transitive callers)
  • RemoteEmbeddingOptions.timeoutMs in src/domain/search/providers/remote.ts:22 (0 transitive callers)
  • OpenAIEmbeddingItem.embedding in src/domain/search/providers/remote.ts:33 (0 transitive callers)
  • OpenAIEmbeddingItem.index in src/domain/search/providers/remote.ts:34 (0 transitive callers)
  • OpenAIEmbeddingResponse.data in src/domain/search/providers/remote.ts:38 (0 transitive callers)
  • embeddingsEndpoint in src/domain/search/providers/remote.ts:41 (6 transitive callers)
  • resolveRemoteEmbeddingOptions in src/domain/search/providers/remote.ts:52 (8 transitive callers)
  • embedRemote in src/domain/search/providers/remote.ts:77 (9 transitive callers)
  • PreparedSearch.storedModel in src/domain/search/search/prepare.ts:25 (0 transitive callers)

…ider

The llm.* fields (baseUrl, apiKey, apiKeyCommand) are now reused by
embeddings.provider: "openai", not just chat-completion features —
the LLM credentials section didn't mention that reuse.
Comment on lines +110 to +115
const sorted = [...json.data].sort((a, b) => a.index - b.index);
for (const item of sorted) {
const vec = Float32Array.from(item.embedding);
if (dim === 0) dim = vec.length;
results.push(vec);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing embedding field validation before Float32Array.from()

The code checks that json.data.length === batch.length, but doesn't verify that each item actually contains a valid embedding array. If a server returns a conforming-length response where one entry is missing the embedding field (e.g. { "index": 0 } with no embedding), Float32Array.from(undefined) throws a raw TypeError: undefined is not iterable rather than a wrapped EngineError. This produces a confusing and untraceable failure for the user.

Suggested change
const sorted = [...json.data].sort((a, b) => a.index - b.index);
for (const item of sorted) {
const vec = Float32Array.from(item.embedding);
if (dim === 0) dim = vec.length;
results.push(vec);
}
const sorted = [...json.data].sort((a, b) => a.index - b.index);
for (const item of sorted) {
if (!Array.isArray(item.embedding)) {
throw new EngineError(
`Remote embedding endpoint ${url} returned an item with a missing or non-array "embedding" field (index ${item.index})`,
);
}
const vec = Float32Array.from(item.embedding);
if (dim === 0) dim = vec.length;
results.push(vec);
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in f1030ad. Added an explicit Array.isArray(item.embedding) check before Float32Array.from() that throws a wrapped EngineError naming the offending item's index, instead of letting a malformed entry surface as a raw TypeError: undefined is not iterable. Added a test with a response item missing the embedding field.

…te embedding provider (#1716)

Greptile review: embedRemote's fetch calls had no timeout, so an
unresponsive self-hosted server would hang the process indefinitely with
no actionable error. Wrap each batch request in an AbortController with a
configurable ceiling (llm.requestTimeoutMs, default 120s) and throw an
EngineError describing the elapsed time on abort.

Also validate that every vector in a response has the same dimension as
the first one seen; a misbehaving server that returns mixed-length
embeddings across (or within) a batch now fails fast with a clear error
instead of silently corrupting `dim` and producing garbage similarity
scores or a TypedArray range error far from the source.

Impact: 4 functions changed, 9 affected
…ive config (#1716)

Greptile review: embedQuery() decided which backend to call by inspecting
config.embeddings?.provider at search time rather than what actually
produced the stored embeddings. If a DB was embedded with provider:
"openai" and the config later drifted (e.g. the field was cleared on a
different machine), routing silently fell back to the local model. When
the local and remote models happen to share an output dimension, the
existing dimension-mismatch guard wouldn't catch it, and search would
compute cosine similarity between incompatible vector spaces without any
error.

Record the provider ("openai" or omitted for local) in embedding_meta at
persist time, thread it through prepareSearch as `storedProvider`, and
key embedQuery's routing decision off that instead of the live config.
The live config is still used to resolve where to send the request
(llm.baseUrl/apiKey), just not whether to send it there at all.

Impact: 7 functions changed, 11 affected
…e provider (#1716)

Greptile review: the response-shape check only verified json.data.length
matched the batch size, not that each item actually carried a valid
embedding array. A conforming-length response with one malformed entry
(e.g. { "index": 0 } with no embedding) made Float32Array.from(undefined)
throw a raw TypeError instead of the wrapped EngineError used everywhere
else in this module, producing a confusing, untraceable failure.

Impact: 1 functions changed, 9 affected
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

…cal switch

Verifies that a full local-model rebuild never inherits a stale 'openai'
provider marker from a prior remote-provider build, since buildEmbeddings
always wipes embedding_meta before repopulating it.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Re: the "stale provider row" finding in the summary (src/domain/search/generator.tspersistEmbeddings skipping the provider key when null).

This isn't actually reachable: buildEmbeddings always calls loadNodesByFile(db) first, which unconditionally runs DELETE FROM embedding_meta before persistEmbeddings re-populates it (same function call, same DB transaction sequence). So a local-model rebuild after a remote build starts from an empty embedding_meta table every time — there's no code path where an old 'openai' row can survive into a later build. embedding_meta is only ever written from this one place; there's no incremental/partial update path.

Added tests/search/embedding-provider-metadata.test.ts (commit a68dd67) which runs a remote build followed by a local-model build against the same DB and asserts the stored provider is no longer 'openai' afterward — confirming the current code already behaves correctly here. No source change made since there's nothing to fix.

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

codegraph models has no validate() gate like embed does, so a config
with embeddings.provider set but embeddings.model unset would
interpolate the null model straight into the banner text.

Impact: 2 functions changed, 0 affected
…options

fakeCtx's llm defaults omitted requestTimeoutMs, so timeoutMs was
undefined on both sides of the toEqual comparison and the assertion
passed without ever checking the value was threaded through.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Addressed both minor items from the latest summary:

  • src/cli/commands/models.ts: fixed — the banner now falls back to "(not configured — set embeddings.model)" instead of interpolating the literal null when embeddings.provider is set without embeddings.model. Added tests/unit/models-command.test.ts covering both the missing-model and configured-model cases. (commit 4fcdc9c)
  • tests/unit/embed-command.test.ts: fixed the coverage gap — fakeCtx's llm override now sets requestTimeoutMs, and the toEqual assertion now includes timeoutMs, so the test actually verifies the value is threaded through (previously both sides were undefined and the assertion passed vacuously). (commit 4e54788)

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 3ec424a into main Jul 2, 2026
23 of 24 checks passed
@carlos-alm carlos-alm deleted the feat/remote-embedding-provider branch July 2, 2026 03:30
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Embedding model from url endpoint

1 participant