Skip to content

feat(engine): unify constraint validation across all write surfaces#314

Open
ragnorc wants to merge 1 commit into
mainfrom
branch-ops-lance-alignment
Open

feat(engine): unify constraint validation across all write surfaces#314
ragnorc wants to merge 1 commit into
mainfrom
branch-ops-lance-alignment

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What & why

Constraint enforcement (value/range/check, enum, uniqueness, edge referential integrity, cardinality) was implemented separately in the bulk loader, the mutation executor, and the branch-merge path, and had drifted — merge validated @range/@check but not enum, and neither the mutation nor the load path enforced uniqueness against already-committed rows. This routes all three write surfaces through one catalog-derived, delta-scoped, index-backed evaluator (crate::validate) that reuses the existing leaf checks, closing the drift class by construction and making a small-delta merge's validation cost flat in graph size instead of O(V+E).

Backing issue / RFC

  • Fixes an accepted issue: Closes #
  • Implements / is an accepted RFC:
  • Trivial fast-lane — no issue/RFC required

Maintainer-internal change (part of the ongoing branch-ops / Lance-alignment work); per the template note above, the external accepted-issue/RFC link requirement does not apply to maintainers.

Checklist

  • Change is focused (one logical change: unify validation into a single evaluator)
  • Tests added/updated for behavior changes (validators.rs cross-version uniqueness + per-table overwrite; new merge_cost.rs delta-scoped cost budget; equivalence via merge_truth_table/branching)
  • Public docs updated (docs/dev/invariants.md, docs/dev/testing.md)
  • Reviewed against docs/dev/invariants.md — no Hard Invariant weakened, no deny-list item hit

Notes for reviewers

Behavior changes are all stricter, none relaxed:

  • Enum constraints are now enforced on the merge path (was a gap).
  • A write or load whose @unique value collides with an already-committed different row is now rejected (cross-version uniqueness); re-upserting an existing @key still upserts.
  • Uniqueness distinguishes a duplicate key within one input batch (two distinct records → rejected, e.g. a bulk load listing a @key twice) from the same id reappearing across batches (ordered supersession of one logical row → coalesced, e.g. a mutation insert-then-update).
  • Overwrite loads validate per-table: a touched table's committed view is its replacement image (empty), but a table absent from the batch keeps its committed rows, so an edges-only overwrite still resolves referential integrity against retained nodes.

Also removes the per-surface validation orchestration the evaluator supersedes (~600 lines) and the now-orphaned version-pinned dataset opener from the sealed storage trait (reads route through the snapshot path). Risk areas: the within-batch-vs-across-batch uniqueness split (relies on each mutation statement emitting its own batch and never repeating an id within one batch) and the index-backed committed probes (degrade to a tail scan over uncovered fragments — the same optimize cadence dependency as the rest of the engine). Full engine suite green; cargo build --workspace --locked clean.


Note

Medium Risk
Refactors integrity enforcement on all write surfaces and changes merge validation I/O; behavior is stricter and well-tested, but index-backed probes and the within-batch vs across-batch uniqueness split are subtle regression surfaces.

Overview
Introduces crate::validate, a catalog-driven evaluator that runs value/enum, uniqueness, edge RI, and @card over a per-table change-set (delta rows + deletes), probing committed state through BTREE filters instead of scanning every table.

Merge stops building a full staged merge table for validation and drops the old whole-graph scan; it builds a projected delta change-set and calls the shared evaluator against the target snapshot. Mutations defer checks to end-of-query via validate_staged_mutation / staging.to_changeset() (including LIVE HEAD cardinality on edges). Bulk load does one evaluator pass before commit, with per-table Overwrite semantics for committed view.

Removes the duplicated orchestration (~600 lines) from merge, mutation, staging (cardinality helpers), and loader, plus open_dataset_at_state from the storage trait. Behavior is stricter only: merge now enforces enums; mutation and load reject @unique collisions with existing rows ( @key re-upsert still allowed).

Adds merge_cost.rs (delta-scoped table opens) and expands validators.rs; updates docs/dev/invariants.md and docs/dev/testing.md.

Reviewed by Cursor Bugbot for commit 0ee6c41. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR moves write constraint checks into one shared validator. The main changes are:

  • A new catalog-derived validation module for value, enum, uniqueness, edge RI, and cardinality checks.
  • Merge validation now uses staged deltas plus committed-state probes instead of full merged-table scans.
  • Mutation and loader paths validate accumulated staging at the end of the write.
  • Old per-surface validation helpers and the unused version-pinned dataset opener were removed.
  • Tests and developer docs were updated for the unified validation behavior.

Confidence Score: 4/5

The changed validation flow has one contained correctness issue in edge cardinality updates.

  • Moving an existing edge to a new source can skip the old source during cardinality validation.
  • The rest of the shared validation flow matches the staged write model from the inspected paths.

crates/omnigraph/src/validate.rs

Important Files Changed

Filename Overview
crates/omnigraph/src/validate.rs Adds the shared evaluator; cardinality misses the previous source of changed edge ids.
crates/omnigraph/src/exec/merge.rs Switches merge validation from full staged tables to delta changesets and committed target probes.
crates/omnigraph/src/exec/mutation.rs Moves constructive mutation validation to one end-of-query pass over staging.
crates/omnigraph/src/loader/mod.rs Moves bulk-load validation to one end-of-load pass over staged batches.
crates/omnigraph/src/exec/staging.rs Adds ChangeSet conversion for pending batches and removes old cardinality helpers.
crates/omnigraph/src/storage_layer.rs Removes the unused version-pinned dataset opener from the sealed storage trait.
crates/omnigraph/src/table_store.rs Removes the matching TableStore version-pinned opener implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
  A[Mutation, Load, or Merge] --> B[Build ChangeSet]
  B --> C[Derive constraints from catalog]
  C --> D[Run shared validator]
  D --> E[Value and enum checks]
  D --> F[Unique checks against delta and committed rows]
  D --> G[Edge RI checks]
  D --> H[Cardinality checks]
  H --> I[Stage and publish after validation succeeds]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
  A[Mutation, Load, or Merge] --> B[Build ChangeSet]
  B --> C[Derive constraints from catalog]
  C --> D[Run shared validator]
  D --> E[Value and enum checks]
  D --> F[Unique checks against delta and committed rows]
  D --> G[Edge RI checks]
  D --> H[Cardinality checks]
  H --> I[Stage and publish after validation succeeds]
Loading

Fix All in Claude Code

Reviews (1): Last reviewed commit: "feat(engine): unify constraint validatio..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Context used:

  • Context used - AGENTS.md (source)
  • Context used - CLAUDE.md (source)

Constraint enforcement (value/range/check, enum, uniqueness, edge
referential integrity, cardinality) was implemented three times — once
each in the bulk loader, the mutation executor, and the branch-merge
path — and had drifted: merge validated @range/@check but not enum, and
neither the mutation nor the load path enforced cross-version uniqueness
against already-committed rows.

Introduce one catalog-derived evaluator (`crate::validate`) that all
three surfaces route through. It is delta-scoped (checks only the change
set, not the whole graph) and index-backed (probes committed state
through the @key/@unique/src/dst BTREEs instead of full-scanning every
catalog table), reusing the existing leaf checks
(validate_value_constraints, validate_enum_constraints,
composite_unique_key) so the surfaces cannot drift again. A one-row-delta
merge now opens ~3 data tables instead of ~6+, and validation cost is
flat in graph size rather than O(V+E).

Behavior changes (all stricter, none relaxed):
- Enum constraints are now enforced on the merge path (was a gap).
- A write or load whose @unique value collides with an already-committed
  different row is now rejected (cross-version uniqueness); re-upserting
  an existing @key still upserts.
- Uniqueness distinguishes a duplicate key WITHIN one input batch (two
  distinct records -> rejected, e.g. a bulk load listing a @key twice)
  from the SAME id reappearing ACROSS batches (ordered supersession of
  one logical row -> coalesced, e.g. a mutation insert-then-update).
- Overwrite loads validate per-table: a touched table's committed view is
  its replacement image (empty), but a table absent from the batch keeps
  its committed rows, so an edges-only overwrite still resolves
  referential integrity against retained nodes.

Remove the per-surface validation orchestration the evaluator supersedes,
and the now-orphaned version-pinned dataset opener from the sealed
storage trait (reads route through the snapshot path). Docs (invariants,
testing) updated; full engine suite green.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.

}
for (id, src) in &delta_edges {
per_src.entry(src.clone()).or_default().insert(id.clone());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality skips deduped edge rows

High Severity

End-of-query @card checks walk every row in every pending batch without collapsing duplicate edge ids the way dedupe_merge_batches_by_id does at commit. A Merge load or multi-statement mutation that updates the same edge id twice can double-count or leave stale src counts, causing false @card failures or missed violations versus what actually publishes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.

}
for (id, src) in &delta_edges {
per_src.entry(src.clone()).or_default().insert(id.clone());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge card misses old src

Medium Severity

Δ-scoped @card only recounts src values appearing in the merge delta or in direct edge deletes. When a merge upsert moves an existing edge to a new src, the previous source’s committed edges are never recounted, so @card min/max can be wrong after publish even though the old path scanned the full merged edge set.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.

message: format!(
"unique constraint {:?} violated by '{}' and '{}'",
columns, first_row_id, row_id
),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopt merge skips validation

High Severity

When classify_adopt returns AdoptSourceState (advances_head false), build_merge_changeset skips the table entirely, so the unified evaluator never runs value, enum, uniqueness, RI, or cardinality on the adopted snapshot. The prior merge path still scanned the full source (or target) candidate dataset for every catalog table.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.

Comment on lines +817 to +823
let mut affected: HashSet<String> = HashSet::new();
for (_, src) in &delta_edges {
affected.insert(src.clone());
}
for (_, src) in &removed_edges {
affected.insert(src.clone());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Old Edge Source Stays Unchecked

When a merge-load or mutation update keeps an edge id but changes its src, this affected set only includes the new src and explicitly deleted edge ids. The old committed src loses that edge but is never recomputed, so moving the only edge from A to B can leave A below @card(1..) while validation still passes.

Fix in Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ee6c41075

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

let projection: Vec<&str> = projection.iter().map(String::as_str).collect();
let mut change = crate::validate::TableChange::default();
match candidate {
CandidateTableState::AdoptSourceState => continue,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include pointer-switch adopts in merge validation

When a source table is adopted by pointer switch/fork, it can still carry source-side row changes even though it does not advance Lance HEAD. For example, merging main into a target branch that deleted a node while main added an edge referencing that node classifies the edge table as AdoptSourceState (source on main, target branch pointer switch); skipping it here means EdgeRi never sees the new edge, while the target node deletion is kept, so the merge can publish an orphan edge. Represent this adopt as a source-vs-target delta for validation, or otherwise validate the adopted source table against the target snapshot before publishing.

Useful? React with 👍 / 👎.

// BTREE (a non-indexed `@unique` column falls back to a scan).
let mut expr: Option<Expr> = None;
for (column, value) in columns.iter().zip(key.iter()) {
let eq = col(column.as_str()).eq(lit(value.clone()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve typed literals for unique probes

For @unique columns that are not strings, the key values here are already stringified by composite_unique_key, but lit(value.clone()) makes the committed-state lookup compare the real Arrow column (e.g. Date32, Date64, numeric, bool) to a Utf8 literal. In accepted schemas such as due: Date @unique or numeric composite uniques, the new cross-version check can miss the existing holder or fail with a coercion error even though intra-delta uniqueness supports those scalar types. Build the filter literal from the catalog/column type rather than the display key string.

Useful? React with 👍 / 👎.

Comment on lines +841 to +842
for (id, src) in &delta_edges {
per_src.entry(src.clone()).or_default().insert(id.clone());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate merge-load edge deltas before @card

LoadMode::Merge still commits edge batches with last-writer-wins dedupe by id, but this validator counts every raw delta edge. If a merge-load input contains the same edge id twice with different src values, the final table keeps only the last row while the @card check can count that id under multiple sources, producing spurious violations (the removed count_pending_src_with_dedupe handled this case). Normalize merge-load edge deltas by id before adding them to per_src.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant