Don't extract NULL values into a lookup row (#186)#750
Conversation
`table.extract()` built its lookup table with
`INSERT OR IGNORE ... SELECT DISTINCT <cols> FROM <table>`, which included
the all-NULL combination. That created a spurious lookup row for NULL and
pointed every NULL source row at it, instead of leaving those rows with a
NULL foreign key.
Before:
db["creatures"].extract("type")
# type lookup: [{"id": 1, "type": None}, {"id": 2, "type": "dog"}]
# creatures: Simon -> type_id=1, Natalie -> type_id=1, Cleo -> type_id=2
After:
# type lookup: [{"id": 1, "type": "dog"}]
# creatures: Simon -> type_id=None, Natalie -> type_id=None, Cleo -> type_id=1
A row whose extracted columns are entirely NULL represents "no value", so it
now keeps a NULL foreign key and no lookup row is created for it. The fix adds
a `WHERE NOT (<col> IS NULL AND ...)` guard to the lookup INSERT; the existing
`IS`-based foreign-key UPDATE then leaves those rows NULL automatically (the
subquery finds no matching lookup row).
For multi-column extracts, only the fully-NULL combination is skipped — a
partial-NULL combination (some extracted columns set, others NULL) is a
genuine distinct value and is still extracted and shared between matching rows.
Updates test_extract_works_with_null_values to assert the corrected behaviour
and adds regression tests for the single-column and multi-column cases.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ba6eb1263
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # NULL) are a genuine distinct value and are still extracted. | ||
| self.db.execute( | ||
| "INSERT OR IGNORE INTO {} ({lookup_columns}) SELECT DISTINCT {table_cols} FROM {}".format( | ||
| "INSERT OR IGNORE INTO {} ({lookup_columns}) SELECT DISTINCT {table_cols} FROM {} WHERE NOT ({all_null})".format( |
There was a problem hiding this comment.
Leave all-null rows unlinked with existing lookups
When extract() is pointed at an existing lookup table that already contains an all-NULL row (for example one created by a previous sqlite-utils version or manually), this WHERE only prevents inserting another all-NULL lookup row; the later UPDATE ... WHERE source.col IS lookup.col still matches the existing NULL row and sets the new foreign key to its id. Since reusing lookup tables is a supported path in this method, all-NULL source rows still fail to keep a NULL foreign key in that scenario.
Useful? React with 👍 / 👎.
…lookup row Follow-up to the simonw#186 fix: skipping the all-NULL combination in the lookup INSERT is not enough on its own. When extract() reuses a lookup table that already contains an all-NULL row (e.g. one written by an older sqlite-utils version or created manually), the IS-based foreign-key UPDATE would still match that row and link all-NULL source rows to it. The UPDATE now carries a trailing `WHERE NOT (<col> IS NULL AND ...)` so all-NULL source rows are never assigned a foreign key and keep the NULL that the freshly-added column starts with, regardless of the lookup table's existing contents. Adds test_extract_all_null_stays_null_with_preexisting_null_lookup_row. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Good catch on the reuse case — confirmed it reproduces: with a lookup table that already contains an all-NULL row, the Pushed 89fe8bd: the UPDATE now carries a trailing |
Refs #186.
Problem
table.extract()populated its lookup table with:That
SELECT DISTINCTincludes the all-NULL combination, so a spurious lookup row is created for NULL and every NULL source row is pointed at it — instead of keeping a NULL foreign key.Before
After
Fix
A row whose extracted columns are entirely NULL represents "no value", so it should keep a NULL foreign key and not get a lookup row. The lookup
INSERTnow has aWHERE NOT (<col> IS NULL AND ...)guard, and the existingIS-based foreign-keyUPDATEthen leaves those rows NULL automatically (its subquery finds no matching lookup row).For multi-column extracts, only the fully-NULL combination is skipped — a partial-NULL combination (some extracted columns set, others NULL) is a genuine distinct value and is still extracted and shared between matching rows.
Tests
test_extract_works_with_null_valuesto assert the corrected behaviour (NULL stays NULL, no NULL lookup row).test_extract_does_not_create_lookup_row_for_all_null(single column).test_extract_multi_column_keeps_partial_null_but_not_all_null(multi-column partial vs all-NULL).📚 Documentation preview 📚: https://sqlite-utils--750.org.readthedocs.build/en/750/