OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services by snyaggarwal · Pull Request #878 · OpenConceptLab/oclapi2

snyaggarwal · 2026-06-12T01:45:50Z

Linked Issue

Closes OpenConceptLab/ocl_online#116

…nking with fallback to inline load of model in api/indexing services

paynejd

Reviewed end-to-end. Strong PR — using off-the-shelf Infinity instead of a hand-rolled /$embed/+/$rerank/ service is the right call, the NO_LM removal is clean (no lingering settings.LM/settings.ENCODER refs), and the VectorEmbed/Reranker unit tests are good. A few items to address before merge (infra items are on oclinfrastructure#6):

Should-fix

#1 + #3 (inline suggestion below): the local fallback rebuilds SentenceTransformer on every call (was a preloaded singleton) and returns np.float32 instead of .tolist() native floats. Both fixed in one suggestion.
#2 (infra): the memory right-sizing on #6 assumes models never load in-process, but the fallback loads them (the api fallback also loads the ~1GB+ CrossEncoder). A service outage — the exact case the fallback exists for — is when the slimmed container is most likely to OOM. Keep headroom for a fallback load, or make the fallback explicitly disable-able so a slimmed container fails fast instead of OOM-thrashing.

Verify

#5 — re-index + normalization parity: Infinity may L2-normalize all-MiniLM-L6-v2 output by default; SentenceTransformer.encode does not. Index-time and query-time are consistent going forward (both via the service), but (a) a full re-index is required at cutover since existing prod vectors were built in-process, and (b) a local fallback query against a service-built index could silently degrade kNN if normalization differs. Please confirm cosine(same text) ≈ 1.0 between Infinity and the local model before relying on the fallback.

Nits

(a) .env.example:12 and docker-compose.override.yml.bak:40 still set NO_LM=TRUE — stale, replace with EMBEDDING_SERVICE_URL / INFINITY_API_KEY. (Couldn't inline-suggest — those lines aren't in the diff.)
(b) inline below — hoist import requests to module top.
(f) leaving api without a depends_on on ocl-embeddings-api is correct — the existing "do not depend on other services" comment is intentional and the fallback handles startup ordering gracefully. No change.

Testing: the fallback tests mock _get_embedding_locally, so they never exercise the real per-call reload (#1) or the np.float32 return (#3), and nothing covers the indexing path (documents.py) going through the service. Worth one test that lets the real local path run on a tiny input and asserts the returned vector is a list of plain floats.

paynejd · 2026-06-12T10:47:13Z

+    def _get_embedding_locally(self, txt):
+        from sentence_transformers import SentenceTransformer
+        model = SentenceTransformer(self.model_name)
+        return list(model.encode(str(txt)))


#1 + #3 — cache the local model and return native floats. The fallback currently reloads a ~400MB SentenceTransformer on every embed() call (the old path used the preloaded settings.LM singleton), and returns list(np.ndarray) = np.float32 elements where the old call site relied on .tolist() to get JSON-safe native floats. Caching per model name makes a mid-indexing outage cheap instead of catastrophic, and .tolist() restores native floats for the ES query_vector/index payload:

Suggested change

def _get_embedding_locally(self, txt):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(self.model_name)

return list(model.encode(str(txt)))

_LOCAL_MODELS = {}

def _get_embedding_locally(self, txt):

model = self._LOCAL_MODELS.get(self.model_name)

if model is None:

from sentence_transformers import SentenceTransformer

model = self._LOCAL_MODELS[self.model_name] = SentenceTransformer(self.model_name)

return model.encode(str(txt)).tolist()

paynejd · 2026-06-12T10:47:13Z

+
+    def _get_embedding_from_service(self, txt):
+        try:
+            import requests as req


Nit (b): import requests as req here and again in _get_rerank_scores_from_service (~line 418). requests is already a top-level dependency — hoist a single import requests into the module import block and drop the per-call aliased imports.

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and rera…

5601a89

…nking with fallback to inline load of model in api/indexing services

snyaggarwal requested a review from paynejd June 12, 2026 01:45

snyaggarwal self-assigned this Jun 12, 2026

OpenConceptLab/ocl_online#116 | suppressing tracking

ab846d9

paynejd reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878

OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878
snyaggarwal wants to merge 2 commits into
masterfrom
ocl_online/issues#116

snyaggarwal commented Jun 12, 2026

Uh oh!

paynejd left a comment

Uh oh!

paynejd Jun 12, 2026

Uh oh!

paynejd Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snyaggarwal commented Jun 12, 2026

Linked Issue

Uh oh!

paynejd left a comment

Choose a reason for hiding this comment

Uh oh!

paynejd Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

paynejd Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants