Skip to content

Comments

Add vector-first Lance ANN path for Cypher vector rerank#140

Merged
beinan merged 4 commits intolance-format:mainfrom
beinan:feature/vector-first-lance-index
Feb 19, 2026
Merged

Add vector-first Lance ANN path for Cypher vector rerank#140
beinan merged 4 commits intolance-format:mainfrom
beinan:feature/vector-first-lance-index

Conversation

@beinan
Copy link
Collaborator

@beinan beinan commented Feb 15, 2026

Summary

  • Add a VectorSearch.use_lance_index(True) opt-in to run a vector-first path when datasets are Lance datasets.
  • When enabled, CypherQuery.execute_with_vector_rerank uses Lance ANN (nearest) to get top-k rows for the vector label, then executes Cypher on that reduced dataset.\n- Adds a Python test behind requires_lance to validate the path.

Notes

  • This is an explicit opt-in and only applies when the Cypher query has no WITH/WHERE clauses; otherwise it falls back to the existing rerank behavior.
  • Semantics can differ from candidate-then-rerank if you enable it on filtered queries; the guard avoids that by default.

Motivation

Using Lance datasets with ANN indices can significantly reduce latency for GraphRAG hybrid retrieval on large datasets by avoiding full-table reranking.

@beinan beinan changed the title Add vector-first Lance ANN path for Cypher vector rerank [wip] Add vector-first Lance ANN path for Cypher vector rerank Feb 15, 2026
@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2026

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/lance-graph/src/lance_vector_search.rs 0.00% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Add getter methods to core VectorSearch struct (column, get_query_vector,
  get_metric, get_top_k) to allow Python bindings to access internal state
- Remove duplicated fields from Python VectorSearch, keeping only inner and
  use_lance_index
- Refactor python_datasets_to_batches functions to share common logic
- Fix Lance test to use FixedSizeListArray for vector column
- Add test for WHERE clause fallback behavior
- Improve documentation in try_execute_with_lance_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beinan beinan changed the title [wip] Add vector-first Lance ANN path for Cypher vector rerank Add vector-first Lance ANN path for Cypher vector rerank Feb 15, 2026
@beinan
Copy link
Collaborator Author

beinan commented Feb 18, 2026

Manual Test Results ✅

Ran manual verification of the vector-first Lance ANN feature:

Test Setup

  • Created Lance dataset with 5 documents and 3D embeddings
  • Query vector: [1.0, 0.0, 0.0]

Results

Test Description Result
Test 1 Simple query + use_lance_index=True ✅ Returns Doc1, Doc2, Doc5 (vector-first path)
Test 2 Query with WHERE + use_lance_index=True ✅ Falls back to standard rerank, returns only tech docs
Test 3 Simple query + use_lance_index=False ✅ Same results as Test 1 (standard rerank)

Test Output

=== Test 1: Simple query with use_lance_index=True ===
Results (should be Doc1, Doc2, Doc5 - top 3 closest to [1,0,0]):
   d.id d.name      d.embedding
0     1   Doc1  [1.0, 0.0, 0.0]
1     2   Doc2  [0.9, 0.1, 0.0]
2     5   Doc5  [0.5, 0.5, 0.0]

=== Test 2: Query with WHERE (fallback to standard rerank) ===
Results (should only have tech docs: Doc1, Doc2, Doc4):
   d.id d.name      d.embedding  _distance
0     1   Doc1  [1.0, 0.0, 0.0]   0.000000
1     2   Doc2  [0.9, 0.1, 0.0]   0.141421
2     4   Doc4  [0.0, 0.0, 1.0]   1.414214

=== Test 3: Same query with use_lance_index=False ===
Results (should match Test 1):
   d.id d.name      d.embedding  _distance
0     1   Doc1  [1.0, 0.0, 0.0]   0.000000
1     2   Doc2  [0.9, 0.1, 0.0]   0.141421
2     5   Doc5  [0.5, 0.5, 0.0]   0.707107

Observations

  1. Vector-first path works correctly - Uses Lance's to_table(nearest={...}) API
  2. Fallback logic works - WHERE clause triggers fallback to standard rerank (note _distance column appears)
  3. Results are consistent - Both paths return same ordering for simple queries

Minor Note

The vector-first path doesn't include the _distance column in output, while standard rerank does. This could be a follow-up enhancement if users need distance values with the Lance ANN path.

beinan and others added 2 commits February 18, 2026 22:37
- Add Python tests for use_lance_index edge cases:
  - Missing query_vector error
  - Fallback for non-Lance datasets
  - Unqualified column names
  - Builder flag propagation
  - Cosine and dot product metrics
- Add Rust unit tests for helper functions:
  - split_vector_column parsing
  - alias_map_from_query extraction
  - resolve_vector_label resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@ChunxuTang ChunxuTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the use_lance_index=False will generate a _distance column, but the use_lance_index=True won't. I think it's better to consolidate them in a follow-up PR.

@beinan beinan merged commit 7f46f8c into lance-format:main Feb 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants