OpenConceptLab/ocl_online#116 | Using Infinity for embedding and reranking with fallback to inline load of model in api/indexing services#878
Conversation
…nking with fallback to inline load of model in api/indexing services
paynejd
left a comment
There was a problem hiding this comment.
Reviewed end-to-end. Strong PR — using off-the-shelf Infinity instead of a hand-rolled /$embed/+/$rerank/ service is the right call, the NO_LM removal is clean (no lingering settings.LM/settings.ENCODER refs), and the VectorEmbed/Reranker unit tests are good. A few items to address before merge (infra items are on oclinfrastructure#6):
Should-fix
- #1 + #3 (inline suggestion below): the local fallback rebuilds
SentenceTransformeron every call (was a preloaded singleton) and returnsnp.float32instead of.tolist()native floats. Both fixed in one suggestion. - #2 (infra): the memory right-sizing on #6 assumes models never load in-process, but the fallback loads them (the
apifallback also loads the ~1GB+ CrossEncoder). A service outage — the exact case the fallback exists for — is when the slimmed container is most likely to OOM. Keep headroom for a fallback load, or make the fallback explicitly disable-able so a slimmed container fails fast instead of OOM-thrashing.
Verify
- #5 — re-index + normalization parity: Infinity may L2-normalize
all-MiniLM-L6-v2output by default;SentenceTransformer.encodedoes not. Index-time and query-time are consistent going forward (both via the service), but (a) a full re-index is required at cutover since existing prod vectors were built in-process, and (b) a local fallback query against a service-built index could silently degrade kNN if normalization differs. Please confirm cosine(same text) ≈ 1.0 between Infinity and the local model before relying on the fallback.
Nits
- (a)
.env.example:12anddocker-compose.override.yml.bak:40still setNO_LM=TRUE— stale, replace withEMBEDDING_SERVICE_URL/INFINITY_API_KEY. (Couldn't inline-suggest — those lines aren't in the diff.) - (b) inline below — hoist
import requeststo module top. - (f) leaving
apiwithout adepends_ononocl-embeddings-apiis correct — the existing "do not depend on other services" comment is intentional and the fallback handles startup ordering gracefully. No change.
Testing: the fallback tests mock _get_embedding_locally, so they never exercise the real per-call reload (#1) or the np.float32 return (#3), and nothing covers the indexing path (documents.py) going through the service. Worth one test that lets the real local path run on a tiny input and asserts the returned vector is a list of plain floats.
| def _get_embedding_locally(self, txt): | ||
| from sentence_transformers import SentenceTransformer | ||
| model = SentenceTransformer(self.model_name) | ||
| return list(model.encode(str(txt))) |
There was a problem hiding this comment.
#1 + #3 — cache the local model and return native floats. The fallback currently reloads a ~400MB SentenceTransformer on every embed() call (the old path used the preloaded settings.LM singleton), and returns list(np.ndarray) = np.float32 elements where the old call site relied on .tolist() to get JSON-safe native floats. Caching per model name makes a mid-indexing outage cheap instead of catastrophic, and .tolist() restores native floats for the ES query_vector/index payload:
| def _get_embedding_locally(self, txt): | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer(self.model_name) | |
| return list(model.encode(str(txt))) | |
| _LOCAL_MODELS = {} | |
| def _get_embedding_locally(self, txt): | |
| model = self._LOCAL_MODELS.get(self.model_name) | |
| if model is None: | |
| from sentence_transformers import SentenceTransformer | |
| model = self._LOCAL_MODELS[self.model_name] = SentenceTransformer(self.model_name) | |
| return model.encode(str(txt)).tolist() |
|
|
||
| def _get_embedding_from_service(self, txt): | ||
| try: | ||
| import requests as req |
There was a problem hiding this comment.
Nit (b): import requests as req here and again in _get_rerank_scores_from_service (~line 418). requests is already a top-level dependency — hoist a single import requests into the module import block and drop the per-call aliased imports.
Linked Issue
Closes OpenConceptLab/ocl_online#116