Skip to content

Direct COSINE_SIMILARITY metric#582

Draft
rbs333 wants to merge 1 commit intomainfrom
cosine_similarity_metric
Draft

Direct COSINE_SIMILARITY metric#582
rbs333 wants to merge 1 commit intomainfrom
cosine_similarity_metric

Conversation

@rbs333
Copy link
Copy Markdown
Collaborator

@rbs333 rbs333 commented Apr 13, 2026

See: rbs333/RediSearch#1

If this were merged into core Redis than we could remove our conversion logic and have a cleaner implementation of direct cosine_similarity.

PR Summary: Add COSINE_SIMILARITY support in RediSearch

Summary

This change adds public RediSearch support for DISTANCE_METRIC COSINE_SIMILARITY while preserving the existing internal cosine execution path.

The implementation is intentionally non-breaking:

  • Existing L2, IP, and COSINE behavior remains unchanged.
  • COSINE_SIMILARITY is exposed as a new public metric name.
  • Internally, search/index execution continues to reuse cosine-distance behavior.
  • No new VecSim ordering, heap, or comparator logic is introduced in RediSearch.

Why

  • Industry standard with vector databases is cosine_similarity with range [-1, 1].
  • Many customers existing downstream apps assume cosine_similarity so lack of support adds friction for replacement.
  • Many ecosystem integrations also assume this convention and require us to reverse engineer the number for support.
  • Vector distance metric doesn't intuitively express exact opposite vectors like a negative number does.

RediSearch module changes

Metric parsing and metadata

RediSearch now accepts COSINE_SIMILARITY in schema creation and reports it back through metric stringification.

Touched areas:

  • src/spec.c
  • src/vector_index.h
  • src/vector_index.c

Internal cosine-path reuse

COSINE_SIMILARITY follows the same internal path as COSINE for query execution and vector normalization.

Touched areas:

  • src/vector_normalization.h
  • src/iterators/hybrid_reader.c

Returned score semantics

For fields defined with COSINE_SIMILARITY, RediSearch converts exposed vector scores from cosine distance to cosine similarity at the output boundary:

  • similarity = 1 - distance

This keeps internal ranking unchanged while presenting similarity-style results to users.

Touched areas:

  • src/vector_index.c
  • src/iterators/hybrid_reader.c

Range query semantics

For VECTOR_RANGE on COSINE_SIMILARITY fields, RediSearch interprets the provided threshold as a similarity threshold and translates it before calling the existing range query path:

  • internal radius = 1 - similarity_threshold

The public input is validated against the similarity range [-1, 1].

Touched area:

  • src/vector_index.c

Validation / tests

This PR adds focused RediSearch-side coverage for:

  • FT.CREATE accepting DISTANCE_METRIC COSINE_SIMILARITY
  • KNN result ordering matching cosine behavior
  • returned scores being exposed as cosine similarity values
  • range query thresholds being interpreted as similarity thresholds

Design constraints preserved

  • No changes to existing COSINE, IP, or L2 semantics
  • No new search-time cosine-similarity math in the RediSearch module
  • No new ordering/comparator model
  • No changes to the core cosine ranking behavior

Notes

This PR is designed as a thin RediSearch-layer adaptation:

  • keep cosine-based execution internally
  • translate only the public metric name and exposed score/range semantics

If paired with the corresponding VectorSimilarity changes, this gives users a clean public COSINE_SIMILARITY metric without expanding the internal algorithm surface area.

Copilot AI review requested due to automatic review settings April 13, 2026 17:01
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Apr 13, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds client-side support in RedisVL for the new COSINE_SIMILARITY vector distance metric, aligning schema generation, query behavior, and result post-processing with similarity-style semantics (higher is better).

Changes:

  • Extend VectorDistanceMetric/schema field support to include COSINE_SIMILARITY and ensure it’s passed through to Redis as DISTANCE_METRIC COSINE_SIMILARITY.
  • Adjust vector query validation to default-sort COSINE_SIMILARITY searches by vector_distance descending when using the default sort.
  • Add unit tests covering schema export, default sort behavior, and ensuring similarity scores aren’t re-normalized.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/test_query_types.py Adds tests for cosine-similarity score handling and default DESC sorting behavior.
tests/unit/test_fields.py Verifies FlatVectorField exports COSINE_SIMILARITY in Redis field args.
redisvl/schema/fields.py Introduces VectorDistanceMetric.COSINE_SIMILARITY and disables normalization for it in VECTOR_NORM_MAP.
redisvl/query/query.py Tracks whether vector-distance sorting was defaulted and resets that state when sort_by() is called/cleared.
redisvl/index/index.py Applies default DESC sort for COSINE_SIMILARITY vector queries; validates batched queries before execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -36,6 +41,108 @@ def test_count_query():
assert process_results(fake_result, count_query, "json") == 2
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_results now takes an IndexSchema, but this test still passes the string "json". It works only because the CountQuery early-return skips schema usage; updating the test to pass a real schema (or a minimal IndexSchema fixture) will keep it aligned with the public signature and avoid future breakage if the implementation changes.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +253
if isinstance(query, BaseVectorQuery):
field = self.schema.fields[query._vector_field_name]
dist_metric = VectorDistanceMetric(field.attrs.distance_metric.upper()) # type: ignore
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing self.schema.fields[query._vector_field_name] will raise a raw KeyError if the query references a vector field name not present in the schema. Since this is part of user input validation, it would be better to handle the missing-field case explicitly and raise QueryValidationError with a clear message (instead of leaking KeyError).

Copilot uses AI. Check for mistakes.
Comment on lines +1160 to +1161
for query in queries:
self._validate_query(query)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_query() wraps QueryValidationError to add context ("Invalid query: ..."), but batch_query() now calls _validate_query() without similar handling. Consider catching QueryValidationError here as well (and indicating which query in the batch failed) so batch and non-batch APIs report validation errors consistently.

Suggested change
for query in queries:
self._validate_query(query)
for i, query in enumerate(queries):
try:
self._validate_query(query)
except QueryValidationError as e:
raise QueryValidationError(
f"Invalid query at batch index {i}: {str(e)}"
) from e

Copilot uses AI. Check for mistakes.
Comment on lines +2085 to +2086
for query in queries:
self._validate_query(query)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as sync batch_query(): now that _validate_query() is called here, consider catching QueryValidationError and adding context about which query failed validation so async batch and single-query APIs have consistent error reporting.

Suggested change
for query in queries:
self._validate_query(query)
for i, query in enumerate(queries):
try:
self._validate_query(query)
except QueryValidationError as e:
raise QueryValidationError(
f"Invalid query at batch index {i}: {str(e)}"
) from e

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants