Vector Similarity Search
We would like to add vector search support to Vortex.
Plan
At a minimum, we likely need to support (some of this already exists):
- a stable vector representation
- reusable normalization (either as an encoding or potentially a serialized scalar fn - issue potetially coming soon)
- exact and approximate execution paths (since quantization is lossy)
- vector-specific encodings (like TurboQuant or Spherical transforms)
- compressor/runtime optimizations for common query shapes
- persisted indexes (probably supported by a new layout)
- a clear user-facing model for lossy storage and approximate compute
Note that we still have not decided how to communicate or expose approximation/lossiness in a principled way.
Design
Details
Vector should be a first-class data model for fixed-length float embeddings, with clean integration into the dtype system, expression system, serialization, and compression pipeline.
- Similarity metrics should be modeled as first-class operations, not just ad hoc helper functions. At minimum this includes
inner_product, cosine_similarity, and l2_norm.
- Unit-norm factoring should be a reusable encoding or representation rather than something hardcoded inside TurboQuant. This gives us a common substrate for cosine similarity, normalized-dot-product rewrites, and future vector encodings.
- Vector encodings should be pluggable and composable in the same spirit as the rest of Vortex. TurboQuant is the first serious candidate, but it should not be the only path and should not have to own normalization, approximation semantics, and query execution by itself (especially since it is lossy).
- Similarity search needs dedicated execution paths for the common “constant query vector against a column of vectors” case. This shows up naturally as
ConstantArray on one side, and we should not pay the same cost as the fully general pairwise case.
There are also more high-level things that likely deserve their own issues:
- We should distinguish exact scan, encoded scan, and indexed search as separate but interoperable execution strategies. These are different tradeoffs and should be visible to the planner.
- Vector search likely wants a first-class top-k / nearest-neighbor operator rather than forcing users to manually compose “compute similarity for every row, then sort, then limit”.
- Persisted vector indexes should be part of the story. Even if we start with exact flat scan and encoded scan, the architecture should leave room for ANN indexes and row-id based retrieval.
- We need a principled API for approximation and lossiness. Right now
ApproxOptions is a useful start, but it is not yet a full user-facing contract for what is allowed to be approximate, when approximation is chosen automatically, or how accuracy tradeoffs are surfaced.
Steps / History
And then is no particular order:
Unresolved Questions
- What is the right abstraction for unit normalization: an encoding, a wrapper type, or just a rewrite rule?
- Should cosine similarity be implemented primarily as normalized dot product, or should it remain a distinct primitive all the way through planning and execution?
- What should the first-class query surface look like for vector search?
- Should Vortex expose a dedicated nearest-neighbor operator, or should this remain expression + order-by + limit for now?
- What is the right boundary between exact compute and approximate compute in the public API?
- How do we meaningfully communicate “lossy” storage to users?
- How do we meaningfully communicate “approximate” compute to users?
- Should the compressor automatically choose a lossy encoding or approximate execution path, or must that always be opt-in (I feel like this should be opt-in)?
- How should approximate results interact with determinism, reproducibility, and explainability?
- Which index strategy should we build first?
- How should vector indexes compose with the existing layout / pruning / scan infrastructure?
Vector Similarity Search
We would like to add vector search support to Vortex.
Plan
At a minimum, we likely need to support (some of this already exists):
Note that we still have not decided how to communicate or expose approximation/lossiness in a principled way.
Design
Details
Vectorshould be a first-class data model for fixed-length float embeddings, with clean integration into the dtype system, expression system, serialization, and compression pipeline.inner_product,cosine_similarity, andl2_norm.ConstantArrayon one side, and we should not pay the same cost as the fully general pairwise case.There are also more high-level things that likely deserve their own issues:
ApproxOptionsis a useful start, but it is not yet a full user-facing contract for what is allowed to be approximate, when approximation is chosen automatically, or how accuracy tradeoffs are surfaced.Steps / History
Vectorextension type: Vector Extension Type #6964cosine_similarity: Vortex Fixed-Shape Tensor #6812l2_norm: Vector Extension Type #6964l2_denorm: L2 Denorm expression #7329inner_product: TurboQuant encoding for Vectors #7269sorfor some "make random" reversible expressionApproxOptionsmodel for tensor expressions: Approximate expressions for tensor types #7226TurboQuantmetadata to be protobuf #7301L2Denorm(norms, Sorf(matrix, Dict(centroids, codes)))L2Denormfrom TurboQuant #7349ScalarValue::Arraysupport for perfomant list scalars (in progress):ScalarValue::Array#6717ConstantArrayConstantchildren #7394And then is no particular order:
Unresolved Questions