Expand SQLite3 data validation by PimSanders · Pull Request #23 · fox-it/dissect.database

PimSanders · 2026-02-02T15:30:23Z

This PR close #16 by expanding the data validation capabilities in SQLite3.

The SQLite3 WAL file can store multiple versions of the same frame, when reading only valid frames should be returned. The docs define a valid frame as follows:

A frame is considered valid if and only if the following conditions are true:

The salt-1 and salt-2 values in the frame-header match salt values in the wal-header

The checksum values in the final 8 bytes of the frame-header exactly match the checksum computed consecutively on the first 24 bytes of the WAL header and the first 8 bytes and the content of all frames up to and including the current frame.

The first check was already implemented, I have interpreted the second check as:

The checksum values in the final 8 bytes of the frame-header (checksum-1 and checksum-2) exactly match the computed checksum over:

the first 24 bytes of the WAL header

the first 8 bytes of each frame header (up to and including this frame)

the page data of each frame (up to and including this frame)

When initializing a database the option validate_checksum can be passed to use the new validation. I have chosen to only calculate the salts by default (just like before) as this will probably be good enough, and a lot faster. See the example below for the time impact:

In [1]: %timeit -n10 list(list(sqlite3.SQLite3(Path("./big.sqlite"), Path("./big.sqlite-wal"), validate_checksum=False).tables())[0].rows())
33 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %timeit -n10 list(list(sqlite3.SQLite3(Path("./big.sqlite"), Path("./big.sqlite-wal"), validate_checksum=True).tables())[0].rows())
1.05 s ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

codecov · 2026-02-02T15:33:06Z

Codecov Report

❌ Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (62f6301) to head (a335ccc).

Files with missing lines	Patch %	Lines
dissect/database/sqlite3/wal.py	0.00%	26 Missing ⚠️
dissect/database/sqlite3/sqlite3.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main     #23   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files        152     152           
  Lines       4716    4738   +22     
=====================================
- Misses      4716    4738   +22

Flag	Coverage Δ
unittests	`0.00% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-02-02T15:48:05Z

Merging this PR will not alter performance

🎉 Hooray! `pytest-codspeed` just leveled up to 5.0.3!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

✅ 6 untouched benchmarks
🆕 2 new benchmarks

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	`test_benchmark_wal_checksum_validation[False]`	N/A	164.5 ms	N/A
🆕	`test_benchmark_wal_checksum_validation[True]`	N/A	9.1 s	N/A

_{Comparing PimSanders:improvement/expand-wal-validation (a335ccc) with main (62f6301)}

Schamper

Maybe add a benchmark test too? I'll look at the actual checksum checking part later when I have a bit more time.

PimSanders · 2026-02-05T19:13:03Z

Take your time, I don't think I will be doing a whole lot of Dissect dev to coming weeks ...

Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>

Schamper

Can you add a benchmark test too, so that we can track future changes to this algorithm?

Schamper · 2026-02-19T08:33:52Z

+        first_frame_offset = len(c_sqlite3.wal_header)
+        offset = first_frame_offset
+
+        while offset <= self.offset:


It seems wasteful to "throw away" the results for every frame we pass, while we may use it in the next frame checksum calculation. But I can't think of a super nice way to keep it. Caching on the Frame object is a bit pointless since it's relatively short-lived. It's LRU cached, but then you might still lose cached checksum information and have to re-checksum half the WAL at some point. How large can WAL logs become? Otherwise we might be able to cache seeds for a given offset in the WAL object

Do you know exactly how the checksumming works if at any point in the middle of the WAL a checksum fails? You'd think that everything after it can never have a matching checksum again, unless future frames just ignore this fact and "checksum" the bad data as part of their checksummed data?

If the former is true, might it be possible to just store the "highest offset" that we verified a good checksum of? Anything that is before that offset is an automatic return True, and anything that comes after that offset can just continue calculating from that offset. If at some point a checksum no longer matches, maybe a boolean can indicate that this is the final highest offset with a valid checksum, and all future checksum checks automatically just become an offset comparison.

Ooh I like the way you're thinking, definitely going to look into this. It seems like a good way to significantly reduce the time it takes to checksum.

During testing I've chosen to save all the intermediate seeds in a dict in the WAL class. Turns out that the default max WAL size has no limit (journal_size_limit=-1) so that approach will definitely give memory issues [1].

Results with intermediate seed saves:

------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_benchmark_wal_checksum_validation[False] (NOW) 79.1537 (1.0) 138.1534 (1.0) 93.6045 (1.0) 22.1834 (1.77) 81.8246 (1.0) 14.6140 (1.14) 2;2 10.6833 (1.0) 11 1 test_benchmark_wal_checksum_validation[False] (0009_a335ccc) 83.3417 (1.05) 149.2447 (1.08) 99.7729 (1.07) 21.8268 (1.75) 89.2470 (1.09) 16.5873 (1.29) 2;2 10.0228 (0.94) 12 1 test_benchmark_wal_checksum_validation[True] (NOW) 173.1041 (2.19) 206.6635 (1.50) 183.2272 (1.96) 12.4998 (1.0) 179.1200 (2.19) 12.8421 (1.0) 1;1 5.4577 (0.51) 6 1 test_benchmark_wal_checksum_validation[True] (0009_a335ccc) 1,306.3053 (16.50) 1,384.3105 (10.02) 1,345.4508 (14.37) 29.8012 (2.38) 1,342.6102 (16.41) 43.1484 (3.36) 2;0 0.7432 (0.07) 5 1 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

PimSanders/dissect.database@improvement/expand-wal-validation...improvement/expand-wal-validation-intermediate-seed

Now I just need to find out if it also works on failed checksummed data.

[1] https://sqlite.org/pragma.html#pragma_journal_size_limit

Results with only storing the highest valid offset:

------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_benchmark_wal_checksum_validation[False] (NOW) 77.9749 (1.0) 141.5346 (1.0) 89.6146 (1.0) 21.0539 (1.14) 79.8784 (1.0) 9.0688 (2.11) 2;2 11.1589 (1.0) 13 1 test_benchmark_wal_checksum_validation[False] (0009_a335ccc) 83.3417 (1.07) 149.2447 (1.05) 99.7729 (1.11) 21.8268 (1.18) 89.2470 (1.12) 16.5873 (3.86) 2;2 10.0228 (0.90) 12 1 test_benchmark_wal_checksum_validation[True] (NOW) 85.5265 (1.10) 149.8305 (1.06) 102.9746 (1.15) 18.4381 (1.0) 97.8143 (1.22) 4.3002 (1.0) 2;4 9.7111 (0.87) 11 1 test_benchmark_wal_checksum_validation[True] (0009_a335ccc) 1,306.3053 (16.75) 1,384.3105 (9.78) 1,345.4508 (15.01) 29.8012 (1.62) 1,342.6102 (16.81) 43.1484 (10.03) 2;0 0.7432 (0.07) 5 1 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

PimSanders/dissect.database@improvement/expand-wal-validation...improvement/expand-wal-validation-last-seed

If I read this part of the source correctly, I would say that your suggested approach (saving the highes valid offset) exactly matches the way SQLite does it [1]:

/* ** The following object holds a copy of the wal-index header content. ** ** The actual header in the wal-index consists of two copies of this ** object followed by one instance of the WalCkptInfo object. ** For all versions of SQLite through 3.10.0 and probably beyond, ** the locking bytes (WalCkptInfo.aLock) start at offset 120 and ** the total header size is 136 bytes. ** ** The szPage value can be any power of 2 between 512 and 32768, inclusive. ** Or it can be 1 to represent a 65536-byte page. The latter case was ** added in 3.7.1 when support for 64K pages was added. */ struct WalIndexHdr { u32 iVersion; /* Wal-index version */ u32 unused; /* Unused (padding) field */ u32 iChange; /* Counter incremented each transaction */ u8 isInit; /* 1 when initialized */ u8 bigEndCksum; /* True if checksums in WAL are big-endian */ u16 szPage; /* Database page size in bytes. 1==64K */ u32 mxFrame; /* Index of last valid frame in the WAL */ u32 nPage; /* Size of database in pages */ u32 aFrameCksum[2]; /* Checksum of last frame in log */ u32 aSalt[2]; /* Two salt values copied from WAL header */ u32 aCksum[2]; /* Checksum over all prior fields */ };

So I think that means that if a checksum were to get corrupted, later checksums just ignore it and continue like nothing happened.

[1] https://github.com/sqlite/sqlite/blob/master/src/wal.c#L308-L333

Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>

Expand WAL validation

0e3eeb6

Schamper requested changes Feb 3, 2026

View reviewed changes

Comment thread dissect/database/sqlite3/sqlite3.py Outdated

Comment thread dissect/database/sqlite3/wal.py Outdated

Comment thread dissect/database/sqlite3/wal.py Outdated

Comment thread dissect/database/sqlite3/wal.py Outdated

PimSanders and others added 2 commits February 18, 2026 21:28

Apply suggestions from code review

e62cf3c

Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>

Apply suggestions from code review

3a99279

Schamper requested changes Feb 18, 2026

View reviewed changes

PimSanders and others added 3 commits February 19, 2026 07:33

Apply suggestions from code review

64470dd

Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>

Apply suggestions from code review

4572811

Fix linter

f4b6ffb

Schamper requested changes Feb 19, 2026

View reviewed changes

Comment thread dissect/database/sqlite3/wal.py Outdated

Schamper reviewed Feb 19, 2026

View reviewed changes

PimSanders and others added 6 commits May 27, 2026 20:09

Merge branch 'main' into improvement/expand-wal-validation

8333edb

Add benchmark

20ffbfb

Fix linter again, as usual

5d0cbc7

Remove commented line

c70d440

Cache initial checksum from WAL header

ba48643

Update dissect/database/sqlite3/wal.py

a335ccc

Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>

Conversation

PimSanders commented Feb 2, 2026

Uh oh!

codecov Bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

🎉 Hooray! pytest-codspeed just leveled up to 5.0.3!

Performance Changes

Uh oh!

Schamper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PimSanders commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Schamper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Schamper Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

PimSanders Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PimSanders May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PimSanders May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PimSanders May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Feb 2, 2026 •

edited

Loading

codspeed-hq Bot commented Feb 2, 2026 •

edited

Loading

🎉 Hooray! `pytest-codspeed` just leveled up to 5.0.3!

PimSanders Feb 19, 2026 •

edited

Loading

PimSanders May 29, 2026 •

edited

Loading

PimSanders May 31, 2026 •

edited

Loading