Skip to content

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878

Open
snyaggarwal wants to merge 2 commits into
masterfrom
ocl_online/issues#116
Open

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878
snyaggarwal wants to merge 2 commits into
masterfrom
ocl_online/issues#116

Conversation

@snyaggarwal

Copy link
Copy Markdown
Contributor

Linked Issue

Closes OpenConceptLab/ocl_online#116

…nking with fallback to inline load of model in api/indexing services
@snyaggarwal snyaggarwal requested a review from paynejd June 12, 2026 01:45
@snyaggarwal snyaggarwal self-assigned this Jun 12, 2026

@paynejd paynejd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end. Strong PR — using off-the-shelf Infinity instead of a hand-rolled /$embed/+/$rerank/ service is the right call, the NO_LM removal is clean (no lingering settings.LM/settings.ENCODER refs), and the VectorEmbed/Reranker unit tests are good. A few items to address before merge (infra items are on oclinfrastructure#6):

Should-fix

  • #1 + #3 (inline suggestion below): the local fallback rebuilds SentenceTransformer on every call (was a preloaded singleton) and returns np.float32 instead of .tolist() native floats. Both fixed in one suggestion.
  • #2 (infra): the memory right-sizing on #6 assumes models never load in-process, but the fallback loads them (the api fallback also loads the ~1GB+ CrossEncoder). A service outage — the exact case the fallback exists for — is when the slimmed container is most likely to OOM. Keep headroom for a fallback load, or make the fallback explicitly disable-able so a slimmed container fails fast instead of OOM-thrashing.

Verify

  • #5 — re-index + normalization parity: Infinity may L2-normalize all-MiniLM-L6-v2 output by default; SentenceTransformer.encode does not. Index-time and query-time are consistent going forward (both via the service), but (a) a full re-index is required at cutover since existing prod vectors were built in-process, and (b) a local fallback query against a service-built index could silently degrade kNN if normalization differs. Please confirm cosine(same text) ≈ 1.0 between Infinity and the local model before relying on the fallback.

Nits

  • (a) .env.example:12 and docker-compose.override.yml.bak:40 still set NO_LM=TRUE — stale, replace with EMBEDDING_SERVICE_URL / INFINITY_API_KEY. (Couldn't inline-suggest — those lines aren't in the diff.)
  • (b) inline below — hoist import requests to module top.
  • (f) leaving api without a depends_on on ocl-embeddings-api is correct — the existing "do not depend on other services" comment is intentional and the fallback handles startup ordering gracefully. No change.

Testing: the fallback tests mock _get_embedding_locally, so they never exercise the real per-call reload (#1) or the np.float32 return (#3), and nothing covers the indexing path (documents.py) going through the service. Worth one test that lets the real local path run on a tiny input and asserts the returned vector is a list of plain floats.

Comment thread core/common/search.py
Comment on lines +364 to +367
def _get_embedding_locally(self, txt):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(self.model_name)
return list(model.encode(str(txt)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 + #3 — cache the local model and return native floats. The fallback currently reloads a ~400MB SentenceTransformer on every embed() call (the old path used the preloaded settings.LM singleton), and returns list(np.ndarray) = np.float32 elements where the old call site relied on .tolist() to get JSON-safe native floats. Caching per model name makes a mid-indexing outage cheap instead of catastrophic, and .tolist() restores native floats for the ES query_vector/index payload:

Suggested change
def _get_embedding_locally(self, txt):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(self.model_name)
return list(model.encode(str(txt)))
_LOCAL_MODELS = {}
def _get_embedding_locally(self, txt):
model = self._LOCAL_MODELS.get(self.model_name)
if model is None:
from sentence_transformers import SentenceTransformer
model = self._LOCAL_MODELS[self.model_name] = SentenceTransformer(self.model_name)
return model.encode(str(txt)).tolist()

Comment thread core/common/search.py

def _get_embedding_from_service(self, txt):
try:
import requests as req

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (b): import requests as req here and again in _get_rerank_scores_from_service (~line 418). requests is already a top-level dependency — hoist a single import requests into the module import block and drop the per-call aliased imports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants