Skip to content

feat(ruvector-graph): VectorPropertyIndex — RaBitQ-backed kNN over node properties (Phase 1 item #2)#387

Open
ruvnet wants to merge 1 commit intomainfrom
feature/graph-vector-property-index
Open

feat(ruvector-graph): VectorPropertyIndex — RaBitQ-backed kNN over node properties (Phase 1 item #2)#387
ruvnet wants to merge 1 commit intomainfrom
feature/graph-vector-property-index

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Apr 26, 2026

Summary

Phase 1 item #2 from docs/research/rabitq-integration/05-roadmap.md. Adds a vector-keyed kNN index for graph nodes via direct-embed (Pattern 1) of ruvector-rabitq. Callers can now do "find the N node ids whose vector property is closest to query" without standing up a separate index crate.

let idx = VectorPropertyIndex::build(
    &graph,
    "embedding",
    VectorPropertyIndexConfig { seed: 42, rerank_factor: 20 },
)?;
let hits: Vec<(NodeId, f32)> = idx.knn(&query, k)?;

Behind the rabitq cargo feature, default-on. --no-default-features keeps the graph crate buildable without ruvector-rabitq.

Important determinism finding

DashMap iteration order is shard-dependent. Two builds in the same process can disagree on which NodeId lives at row 0. Without a fix this would silently break ADR-154's (seed, graph) → bit-identical codes guarantee across runs and shard-count changes.

Fix: VectorPropertyIndex::build sorts NodeIds before encoding. One O(n log n) string sort per build; row→NodeId mapping now stable across runs. Verified by byte_identical_query_results_for_same_seed.

Recall + memory at test sizes

  • n=1k, dim=128, rerank_factor=20:
    • recall@10 = 1.000 vs brute-force (floor: 0.85)
    • codes / originals ratio = 0.176 (rotation matrix dominates at small n; asymptotically codes ≤ originals/16 + dim²·4)

Acceptance test

The roadmap's gate (100k × 768d, recall@10 ≥ 0.95, DRAM ≤ 1/16 f32) is shipped as a criterion bench at benches/vector_property_index.rs defaulting to n=2k. Override:

VECTOR_PROPERTY_INDEX_N=100000 VECTOR_PROPERTY_INDEX_DIM=768 \
  cargo bench -p ruvector-graph --features rabitq

No abstraction yet (deliberate)

The graph crate had no quantizer trait. Kept VectorPropertyIndex concrete (wraps RabitqPlusIndex directly). Phase 1 has one quantizer; an abstraction is unjustified now and easy to add in Phase 2 if a second backend joins.

Verification

  • cargo build --workspace → clean
  • cargo build -p ruvector-graph --no-default-features → clean
  • cargo build -p ruvector-graph --features rabitq → clean
  • cargo clippy --workspace --all-targets --no-deps -- -D warnings → clean
  • cargo fmt --all --check → clean
  • cargo test -p ruvector-graph --features rabitq142 pass (135 lib + 7 new integration)

7 new integration tests:

  • build_and_query_returns_self_at_distance_zero
  • recall_at_10_meets_floor_vs_brute_force
  • byte_identical_query_results_for_same_seed (determinism)
  • build_skips_nodes_without_target_property
  • build_rejects_dim_mismatch
  • len_matches_indexed_node_count
  • empty_graph_yields_empty_index

Independent of the DiskANN stack

Branched from main (PR #380's merge 7a599b7c). No conflicts with the DiskANN PR chain (#383#386). Different crate, different reviewer audience.

🤖 Generated with claude-flow

…de properties

Phase 1 item #2 from `docs/research/rabitq-integration/05-roadmap.md`.
Adds a vector-keyed kNN index for graph nodes via direct-embed
(Pattern 1) of `ruvector-rabitq`. Callers can now ask "find the
N node ids whose vector property is closest to query" without
standing up a separate index crate.

## Surface

```rust
let idx = VectorPropertyIndex::build(
    &graph,
    "embedding",
    VectorPropertyIndexConfig { seed: 42, rerank_factor: 20 },
)?;
let hits: Vec<(NodeId, f32)> = idx.knn(&query, k)?;
```

Behind the `rabitq` cargo feature (default-on; `--no-default-features`
keeps the graph crate buildable without ruvector-rabitq).

## Property-table shape encountered

`NodeId = String`; `GraphDB` stores `DashMap<NodeId, Node>` where
each `Node.properties: HashMap<String, PropertyValue>` and vector
properties live as `PropertyValue::FloatArray(Vec<f32>)` —
already a contiguous f32 slab. Added one new public accessor
`GraphDB::node_ids() -> Vec<NodeId>` so the index can enumerate
without becoming a friend of the DashMap.

## Important determinism finding

`DashMap` iteration order is **shard-dependent**: two builds in
the same process can disagree on which `NodeId` lives at row 0.
Without a fix this would silently break ADR-154's
`(seed, graph) → bit-identical codes` guarantee across runs and
across shard-count changes.

Fix: `VectorPropertyIndex::build` sorts `NodeId`s before encoding.
The cost is one O(n log n) string sort per build; the benefit is
that two `(seed, graph)` pairs always produce the same row→NodeId
mapping. Verified by `byte_identical_query_results_for_same_seed`.

## Recall + memory at the test sizes

- n=1k, dim=128, rerank_factor=20:
    recall@10 = **1.000** vs brute-force (floor: 0.85)
    codes / originals ratio = 0.176 (rotation matrix dominates at
    small n; asymptotically codes ≤ originals/16 + dim²·4)

The 1/16 contract holds asymptotically; small-n is rotation-matrix-
dominated which is the published ADR-154 behavior.

## Acceptance test

The roadmap's M1 acceptance gate (100k × 768d, recall@10 ≥ 0.95,
DRAM ≤ 1/16 of f32 baseline) is shipped as a criterion bench at
`benches/vector_property_index.rs` defaulting to n=2k. Override with
`VECTOR_PROPERTY_INDEX_N=100000 VECTOR_PROPERTY_INDEX_DIM=768
cargo bench -p ruvector-graph --features rabitq` for the full scale.

## No abstraction yet

The graph crate had no quantizer trait. Kept things concrete
(`VectorPropertyIndex` wraps `RabitqPlusIndex` directly) rather
than introducing one. Phase 1 has one quantizer; an abstraction
layer is unjustified now and easy to add in Phase 2.

## Verification

  cargo build --workspace                                              → clean
  cargo build -p ruvector-graph --no-default-features                  → clean
  cargo build -p ruvector-graph --features rabitq                      → clean
  cargo clippy --workspace --all-targets --no-deps -- -D warnings      → clean
  cargo fmt --all --check                                              → clean
  cargo test -p ruvector-graph --features rabitq --lib                 → 135 pass
  cargo test -p ruvector-graph --features rabitq                       → 142 pass total
                                                                          (135 lib + 7 new
                                                                           integration)

New tests in `tests/vector_property_index.rs`:
- `build_and_query_returns_self_at_distance_zero`
- `recall_at_10_meets_floor_vs_brute_force`
- `byte_identical_query_results_for_same_seed` (determinism)
- `build_skips_nodes_without_target_property`
- `build_rejects_dim_mismatch`
- `len_matches_indexed_node_count`
- `empty_graph_yields_empty_index`

## Files

- `src/vector_property_index.rs` (~210 LoC) — new module
- `src/lib.rs` (+8) — gated `pub mod` + re-exports
- `src/graph.rs` (+8) — `node_ids()` accessor
- `src/error.rs` (+9) — `RabitqIndex(String)` variant + gated `From<RabitqError>`
- `Cargo.toml` (+5) — optional dep + `rabitq` feature, folded into `full`
- `tests/vector_property_index.rs` (+245)
- `benches/vector_property_index.rs` (+95) — env-var-tunable

Refs: `docs/research/rabitq-integration/05-roadmap.md` Phase 1 item #2,
ADR-154 (RaBitQ determinism).

Co-Authored-By: claude-flow <ruv@ruv.net>
ruvnet added a commit that referenced this pull request Apr 26, 2026
Unblocks the 7 stacked PRs (#381-#387) and turns `main`'s CI green
for the first time in days. Two issues fixed:

## Failure 1 — Security audit (was: 8 vulnerabilities)

`cargo audit` is now exit 0. 4 of the 5 critical advisories were
fixed by version bumps; only the unfixable one is ignored.

**Dep-bumped:**
- `rustls-webpki 0.101.7` + `0.103.10` → `0.103.13` via
  `cargo update -p rustls-webpki@0.103.10`. Patches:
    RUSTSEC-2026-0098 (URI name constraints)
    RUSTSEC-2026-0099 (wildcard name constraints)
    RUSTSEC-2026-0104 (CRL parsing panic)
- `idna 0.5.0` → `1.1.0` via `validator 0.18 → 0.20` in
  `examples/scipix`. Patches RUSTSEC-2024-0421 (Punycode acceptance).
- Bonus: `reqwest 0.11 → 0.12` (in `ruvector-core` + `examples/benchmarks`)
  and `hf-hub 0.3 → 0.4` (in `ruvector-core` + `ruvllm` +
  `ruvllm-cli`). Removes the entire legacy `rustls 0.21` /
  `rustls-webpki 0.101.7` subtree from the lockfile.

**Ignored** (single advisory, with rationale):
- `RUSTSEC-2023-0071` (rsa Marvin timing sidechannel) — no upstream
  fix available; we don't expose RSA decryption services. Documented
  in `.cargo/audit.toml`.

**Unmaintained warnings** (16 total — proc-macro-error, derivative,
instant, paste, bincode 1, pqcrypto-{kyber,dilithium}, rustls-pemfile 1,
rusttype, wee_alloc, number_prefix, rand_os, core2, lru, pprof, rand) —
each given a one-line justification in `.cargo/audit.toml` so CI stays
green on them while the team decides whether to chase upstream
replacements.

## Failure 2 — Tests timeout (was: 30-min job timeout cancellation)

`.github/workflows/ci.yml` `test` job is now a `matrix` with
`fail-fast: false` and `timeout-minutes: 45`. Six parallel shards
under `cargo nextest run` (installed via `taiki-e/install-action@v2`)
plus a separate `cargo test --doc` step (nextest doesn't run
doctests):

  | Shard            | Crates                                      |
  |------------------|---------------------------------------------|
  | vector-index     | rabitq, rulake, diskann, graph, gnn, cnn    |
  | rvagent          | 10 rvagent-* crates                         |
  | ruvix            | 16 ruvix-* crates                           |
  | ruqu-quantum     | 5 ruqu* crates                              |
  | ml-research      | attention, mincut, scipix, fpga-transformer,|
  |                  | sparse-inference, sparsifier, solver,       |
  |                  | graph-transformer, domain-expansion,        |
  |                  | robotics                                    |
  | core-and-rest    | --workspace minus the above                 |

`Swatinem/rust-cache@v2` is keyed per shard. Audit job switched to
`taiki-e/install-action` for `cargo-audit` (faster than
`cargo install --locked`).

## Verification

  cargo audit                                                   → exit 0
  cargo build --workspace --exclude ruvector-postgres           → clean
  cargo clippy --workspace --exclude ruvector-postgres --no-deps -- -D warnings → exit 0
  cargo fmt --all --check                                       → exit 0

## Cargo.lock churn

166-line diff, net ~120 lines removed (more deletions than
additions). Removed: `idna 0.5.0`, `rustls-webpki 0.101.7`,
`validator 0.18`, `validator_derive 0.18`, `proc-macro-error 1.0.4`.
Added: `rustls-webpki 0.103.13`, `validator 0.20`,
`proc-macro-error2`, `hf-hub 0.4.3`, `reqwest 0.12.28`. No
suspicious crates.

## Recommended merge order

1. **This PR first** — unblocks every other PR's CI.
2. After this lands and main is green, rebase the 7 open PRs
   (#381-#387) one at a time. The DiskANN stack (#383#384#385#386)
   must merge in numeric order. #381 (Python SDK), #382 (research),
   #387 (graph property index) are independent and can merge in
   any order after their CI goes green on the rebase.

Co-Authored-By: claude-flow <ruv@ruv.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant