Skip to content

Don't extract NULL values into a lookup row (#186)#750

Open
JOhnsonKC201 wants to merge 2 commits into
simonw:mainfrom
JOhnsonKC201:fix/extract-null-values-186
Open

Don't extract NULL values into a lookup row (#186)#750
JOhnsonKC201 wants to merge 2 commits into
simonw:mainfrom
JOhnsonKC201:fix/extract-null-values-186

Conversation

@JOhnsonKC201

@JOhnsonKC201 JOhnsonKC201 commented Jun 7, 2026

Copy link
Copy Markdown

Refs #186.

Problem

table.extract() populated its lookup table with:

INSERT OR IGNORE INTO lookup (cols) SELECT DISTINCT cols FROM source

That SELECT DISTINCT includes the all-NULL combination, so a spurious lookup row is created for NULL and every NULL source row is pointed at it — instead of keeping a NULL foreign key.

db["creatures"].insert_all([
    {"id": 1, "name": "Simon", "type": None},
    {"id": 2, "name": "Natalie", "type": None},
    {"id": 3, "name": "Cleo", "type": "dog"}], pk="id")
db["creatures"].extract("type")

Before

type:      [{"id": 1, "type": None}, {"id": 2, "type": "dog"}]
creatures: Simon -> type_id=1, Natalie -> type_id=1, Cleo -> type_id=2

After

type:      [{"id": 1, "type": "dog"}]
creatures: Simon -> type_id=None, Natalie -> type_id=None, Cleo -> type_id=1

Fix

A row whose extracted columns are entirely NULL represents "no value", so it should keep a NULL foreign key and not get a lookup row. The lookup INSERT now has a WHERE NOT (<col> IS NULL AND ...) guard, and the existing IS-based foreign-key UPDATE then leaves those rows NULL automatically (its subquery finds no matching lookup row).

For multi-column extracts, only the fully-NULL combination is skipped — a partial-NULL combination (some extracted columns set, others NULL) is a genuine distinct value and is still extracted and shared between matching rows.

Tests

  • Updated test_extract_works_with_null_values to assert the corrected behaviour (NULL stays NULL, no NULL lookup row).
  • Added test_extract_does_not_create_lookup_row_for_all_null (single column).
  • Added test_extract_multi_column_keeps_partial_null_but_not_all_null (multi-column partial vs all-NULL).
  • Full suite: 1041 passed, 17 skipped locally.

📚 Documentation preview 📚: https://sqlite-utils--750.org.readthedocs.build/en/750/

`table.extract()` built its lookup table with
`INSERT OR IGNORE ... SELECT DISTINCT <cols> FROM <table>`, which included
the all-NULL combination. That created a spurious lookup row for NULL and
pointed every NULL source row at it, instead of leaving those rows with a
NULL foreign key.

Before:
    db["creatures"].extract("type")
    # type lookup: [{"id": 1, "type": None}, {"id": 2, "type": "dog"}]
    # creatures:   Simon -> type_id=1, Natalie -> type_id=1, Cleo -> type_id=2

After:
    # type lookup: [{"id": 1, "type": "dog"}]
    # creatures:   Simon -> type_id=None, Natalie -> type_id=None, Cleo -> type_id=1

A row whose extracted columns are entirely NULL represents "no value", so it
now keeps a NULL foreign key and no lookup row is created for it. The fix adds
a `WHERE NOT (<col> IS NULL AND ...)` guard to the lookup INSERT; the existing
`IS`-based foreign-key UPDATE then leaves those rows NULL automatically (the
subquery finds no matching lookup row).

For multi-column extracts, only the fully-NULL combination is skipped — a
partial-NULL combination (some extracted columns set, others NULL) is a
genuine distinct value and is still extracted and shared between matching rows.

Updates test_extract_works_with_null_values to assert the corrected behaviour
and adds regression tests for the single-column and multi-column cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ba6eb1263

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread sqlite_utils/db.py
# NULL) are a genuine distinct value and are still extracted.
self.db.execute(
"INSERT OR IGNORE INTO {} ({lookup_columns}) SELECT DISTINCT {table_cols} FROM {}".format(
"INSERT OR IGNORE INTO {} ({lookup_columns}) SELECT DISTINCT {table_cols} FROM {} WHERE NOT ({all_null})".format(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Leave all-null rows unlinked with existing lookups

When extract() is pointed at an existing lookup table that already contains an all-NULL row (for example one created by a previous sqlite-utils version or manually), this WHERE only prevents inserting another all-NULL lookup row; the later UPDATE ... WHERE source.col IS lookup.col still matches the existing NULL row and sets the new foreign key to its id. Since reusing lookup tables is a supported path in this method, all-NULL source rows still fail to keep a NULL foreign key in that scenario.

Useful? React with 👍 / 👎.

…lookup row

Follow-up to the simonw#186 fix: skipping the all-NULL combination in the lookup
INSERT is not enough on its own. When extract() reuses a lookup table that
already contains an all-NULL row (e.g. one written by an older sqlite-utils
version or created manually), the IS-based foreign-key UPDATE would still
match that row and link all-NULL source rows to it.

The UPDATE now carries a trailing `WHERE NOT (<col> IS NULL AND ...)` so
all-NULL source rows are never assigned a foreign key and keep the NULL that
the freshly-added column starts with, regardless of the lookup table's
existing contents.

Adds test_extract_all_null_stays_null_with_preexisting_null_lookup_row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JOhnsonKC201

Copy link
Copy Markdown
Author

Good catch on the reuse case — confirmed it reproduces: with a lookup table that already contains an all-NULL row, the IS-based foreign-key UPDATE still linked all-NULL source rows to it.

Pushed 89fe8bd: the UPDATE now carries a trailing WHERE NOT (<col> IS NULL AND ...), so all-NULL source rows keep a NULL foreign key regardless of the lookup table's existing contents. Added test_extract_all_null_stays_null_with_preexisting_null_lookup_row to cover it. Full suite still green (1042 passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant