feat(engine): unify constraint validation across all write surfaces#314
feat(engine): unify constraint validation across all write surfaces#314ragnorc wants to merge 1 commit into
Conversation
Constraint enforcement (value/range/check, enum, uniqueness, edge referential integrity, cardinality) was implemented three times — once each in the bulk loader, the mutation executor, and the branch-merge path — and had drifted: merge validated @range/@check but not enum, and neither the mutation nor the load path enforced cross-version uniqueness against already-committed rows. Introduce one catalog-derived evaluator (`crate::validate`) that all three surfaces route through. It is delta-scoped (checks only the change set, not the whole graph) and index-backed (probes committed state through the @key/@unique/src/dst BTREEs instead of full-scanning every catalog table), reusing the existing leaf checks (validate_value_constraints, validate_enum_constraints, composite_unique_key) so the surfaces cannot drift again. A one-row-delta merge now opens ~3 data tables instead of ~6+, and validation cost is flat in graph size rather than O(V+E). Behavior changes (all stricter, none relaxed): - Enum constraints are now enforced on the merge path (was a gap). - A write or load whose @unique value collides with an already-committed different row is now rejected (cross-version uniqueness); re-upserting an existing @key still upserts. - Uniqueness distinguishes a duplicate key WITHIN one input batch (two distinct records -> rejected, e.g. a bulk load listing a @key twice) from the SAME id reappearing ACROSS batches (ordered supersession of one logical row -> coalesced, e.g. a mutation insert-then-update). - Overwrite loads validate per-table: a touched table's committed view is its replacement image (empty), but a table absent from the batch keeps its committed rows, so an edges-only overwrite still resolves referential integrity against retained nodes. Remove the per-surface validation orchestration the evaluator supersedes, and the now-orphaned version-pinned dataset opener from the sealed storage trait (reads route through the snapshot path). Docs (invariants, testing) updated; full engine suite green.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.
| } | ||
| for (id, src) in &delta_edges { | ||
| per_src.entry(src.clone()).or_default().insert(id.clone()); | ||
| } |
There was a problem hiding this comment.
Cardinality skips deduped edge rows
High Severity
End-of-query @card checks walk every row in every pending batch without collapsing duplicate edge ids the way dedupe_merge_batches_by_id does at commit. A Merge load or multi-statement mutation that updates the same edge id twice can double-count or leave stale src counts, causing false @card failures or missed violations versus what actually publishes.
Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.
| } | ||
| for (id, src) in &delta_edges { | ||
| per_src.entry(src.clone()).or_default().insert(id.clone()); | ||
| } |
There was a problem hiding this comment.
Merge card misses old src
Medium Severity
Δ-scoped @card only recounts src values appearing in the merge delta or in direct edge deletes. When a merge upsert moves an existing edge to a new src, the previous source’s committed edges are never recounted, so @card min/max can be wrong after publish even though the old path scanned the full merged edge set.
Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.
| message: format!( | ||
| "unique constraint {:?} violated by '{}' and '{}'", | ||
| columns, first_row_id, row_id | ||
| ), |
There was a problem hiding this comment.
Adopt merge skips validation
High Severity
When classify_adopt returns AdoptSourceState (advances_head false), build_merge_changeset skips the table entirely, so the unified evaluator never runs value, enum, uniqueness, RI, or cardinality on the adopted snapshot. The prior merge path still scanned the full source (or target) candidate dataset for every catalog table.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0ee6c41. Configure here.
| let mut affected: HashSet<String> = HashSet::new(); | ||
| for (_, src) in &delta_edges { | ||
| affected.insert(src.clone()); | ||
| } | ||
| for (_, src) in &removed_edges { | ||
| affected.insert(src.clone()); | ||
| } |
There was a problem hiding this comment.
Old Edge Source Stays Unchecked
When a merge-load or mutation update keeps an edge id but changes its src, this affected set only includes the new src and explicitly deleted edge ids. The old committed src loses that edge but is never recomputed, so moving the only edge from A to B can leave A below @card(1..) while validation still passes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ee6c41075
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let projection: Vec<&str> = projection.iter().map(String::as_str).collect(); | ||
| let mut change = crate::validate::TableChange::default(); | ||
| match candidate { | ||
| CandidateTableState::AdoptSourceState => continue, |
There was a problem hiding this comment.
Include pointer-switch adopts in merge validation
When a source table is adopted by pointer switch/fork, it can still carry source-side row changes even though it does not advance Lance HEAD. For example, merging main into a target branch that deleted a node while main added an edge referencing that node classifies the edge table as AdoptSourceState (source on main, target branch pointer switch); skipping it here means EdgeRi never sees the new edge, while the target node deletion is kept, so the merge can publish an orphan edge. Represent this adopt as a source-vs-target delta for validation, or otherwise validate the adopted source table against the target snapshot before publishing.
Useful? React with 👍 / 👎.
| // BTREE (a non-indexed `@unique` column falls back to a scan). | ||
| let mut expr: Option<Expr> = None; | ||
| for (column, value) in columns.iter().zip(key.iter()) { | ||
| let eq = col(column.as_str()).eq(lit(value.clone())); |
There was a problem hiding this comment.
Preserve typed literals for unique probes
For @unique columns that are not strings, the key values here are already stringified by composite_unique_key, but lit(value.clone()) makes the committed-state lookup compare the real Arrow column (e.g. Date32, Date64, numeric, bool) to a Utf8 literal. In accepted schemas such as due: Date @unique or numeric composite uniques, the new cross-version check can miss the existing holder or fail with a coercion error even though intra-delta uniqueness supports those scalar types. Build the filter literal from the catalog/column type rather than the display key string.
Useful? React with 👍 / 👎.
| for (id, src) in &delta_edges { | ||
| per_src.entry(src.clone()).or_default().insert(id.clone()); |
There was a problem hiding this comment.
Deduplicate merge-load edge deltas before @card
LoadMode::Merge still commits edge batches with last-writer-wins dedupe by id, but this validator counts every raw delta edge. If a merge-load input contains the same edge id twice with different src values, the final table keeps only the last row while the @card check can count that id under multiple sources, producing spurious violations (the removed count_pending_src_with_dedupe handled this case). Normalize merge-load edge deltas by id before adding them to per_src.
Useful? React with 👍 / 👎.


What & why
Constraint enforcement (value/range/check, enum, uniqueness, edge referential integrity, cardinality) was implemented separately in the bulk loader, the mutation executor, and the branch-merge path, and had drifted — merge validated
@range/@checkbut not enum, and neither the mutation nor the load path enforced uniqueness against already-committed rows. This routes all three write surfaces through one catalog-derived, delta-scoped, index-backed evaluator (crate::validate) that reuses the existing leaf checks, closing the drift class by construction and making a small-delta merge's validation cost flat in graph size instead of O(V+E).Backing issue / RFC
Maintainer-internal change (part of the ongoing branch-ops / Lance-alignment work); per the template note above, the external accepted-issue/RFC link requirement does not apply to maintainers.
Checklist
validators.rscross-version uniqueness + per-table overwrite; newmerge_cost.rsdelta-scoped cost budget; equivalence viamerge_truth_table/branching)docs/dev/invariants.md,docs/dev/testing.md)Notes for reviewers
Behavior changes are all stricter, none relaxed:
@uniquevalue collides with an already-committed different row is now rejected (cross-version uniqueness); re-upserting an existing@keystill upserts.@keytwice) from the same id reappearing across batches (ordered supersession of one logical row → coalesced, e.g. a mutation insert-then-update).Overwriteloads validate per-table: a touched table's committed view is its replacement image (empty), but a table absent from the batch keeps its committed rows, so an edges-only overwrite still resolves referential integrity against retained nodes.Also removes the per-surface validation orchestration the evaluator supersedes (~600 lines) and the now-orphaned version-pinned dataset opener from the sealed storage trait (reads route through the snapshot path). Risk areas: the within-batch-vs-across-batch uniqueness split (relies on each mutation statement emitting its own batch and never repeating an id within one batch) and the index-backed committed probes (degrade to a tail scan over uncovered fragments — the same
optimizecadence dependency as the rest of the engine). Full engine suite green;cargo build --workspace --lockedclean.Note
Medium Risk
Refactors integrity enforcement on all write surfaces and changes merge validation I/O; behavior is stricter and well-tested, but index-backed probes and the within-batch vs across-batch uniqueness split are subtle regression surfaces.
Overview
Introduces
crate::validate, a catalog-driven evaluator that runs value/enum, uniqueness, edge RI, and@cardover a per-table change-set (delta rows + deletes), probing committed state through BTREE filters instead of scanning every table.Merge stops building a full staged merge table for validation and drops the old whole-graph scan; it builds a projected delta change-set and calls the shared evaluator against the target snapshot. Mutations defer checks to end-of-query via
validate_staged_mutation/staging.to_changeset()(including LIVE HEAD cardinality on edges). Bulk load does one evaluator pass before commit, with per-table Overwrite semantics for committed view.Removes the duplicated orchestration (~600 lines) from merge, mutation, staging (cardinality helpers), and loader, plus
open_dataset_at_statefrom the storage trait. Behavior is stricter only: merge now enforces enums; mutation and load reject@uniquecollisions with existing rows (@keyre-upsert still allowed).Adds
merge_cost.rs(delta-scoped table opens) and expandsvalidators.rs; updatesdocs/dev/invariants.mdanddocs/dev/testing.md.Reviewed by Cursor Bugbot for commit 0ee6c41. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR moves write constraint checks into one shared validator. The main changes are:
Confidence Score: 4/5
The changed validation flow has one contained correctness issue in edge cardinality updates.
crates/omnigraph/src/validate.rs
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Mutation, Load, or Merge] --> B[Build ChangeSet] B --> C[Derive constraints from catalog] C --> D[Run shared validator] D --> E[Value and enum checks] D --> F[Unique checks against delta and committed rows] D --> G[Edge RI checks] D --> H[Cardinality checks] H --> I[Stage and publish after validation succeeds]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[Mutation, Load, or Merge] --> B[Build ChangeSet] B --> C[Derive constraints from catalog] C --> D[Run shared validator] D --> E[Value and enum checks] D --> F[Unique checks against delta and committed rows] D --> G[Edge RI checks] D --> H[Cardinality checks] H --> I[Stage and publish after validation succeeds]Reviews (1): Last reviewed commit: "feat(engine): unify constraint validatio..." | Re-trigger Greptile
Context used: