Skip to content

buzhash64: add FastCDC-style normalized chunking#9823

Open
ThomasWaldmann wants to merge 1 commit into
borgbackup:masterfrom
ThomasWaldmann:buzhash64-normalized-chunking
Open

buzhash64: add FastCDC-style normalized chunking#9823
ThomasWaldmann wants to merge 1 commit into
borgbackup:masterfrom
ThomasWaldmann:buzhash64-normalized-chunking

Conversation

@ThomasWaldmann

Copy link
Copy Markdown
Member

Normalized chunking switches between a stricter and a looser cut mask around the target chunk size. This greatly tightens the chunk-size distribution (coefficient of variation ~0.9 -> ~0.3 in tests) and removes the dedup-hostile max-size-clamped chunks, with unchanged deduplication.

chunker-params for buzhash64 gains a required 6th field, nc_level:

buzhash64,chunk_min,chunk_max,chunk_mask,window_size,nc_level

Use nc_level=2 for the new default, nc_level=0 to disable (then behavior is byte-identical to the previous single-mask chunker).

buzhash (32bit) is untouched and stays bit-compatible with borg 1.x.

The mask transition point (normal_size) defaults to a principled formula (target minus the expected loose-phase tail) so the mean stays near the target; it can be tuned via the normal_size constructor arg.

scripts/chunker_bench.py: evidence harness used to measure chunk-size distribution, dedup ratio, throughput and shift-resilience.

Measurements (before = nc_level 0, after = nc_level 2; both at the default params buzhash64,19,23,21,4095; measured with scripts/chunker_bench.py):

5 GiB of incompressible data (~2000-2700 chunks, statistically stable):

before: CV 0.739, 49 max-size-clamped (8 MiB) chunks, 953 MB/s
after: CV 0.311, 0 max-size-clamped chunks, 1024 MB/s

Re-backup of a 2.5 GiB file after a few scattered single-byte edits (deduplication ratio; 0.5 = v2 fully deduplicated against v1, lower is better):

64 edits: before 0.5424 -> after 0.5235
320 edits: before 0.6791 -> after 0.6142

Normalized chunking deduplicates better after edits: removing the max-size-clamped chunks means a single-byte change invalidates much less data (about 36% less dedup overhead at 320 edits). Throughput was also consistently higher with nc_level=2 at this scale.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.17%. Comparing base (106dfba) to head (0e3876d).
⚠️ Report is 10 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/borg/helpers/parseformat.py 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9823      +/-   ##
==========================================
+ Coverage   84.94%   85.17%   +0.23%     
==========================================
  Files          92       92              
  Lines       15291    15307      +16     
  Branches     2296     2301       +5     
==========================================
+ Hits        12989    13038      +49     
+ Misses       1611     1579      -32     
+ Partials      691      690       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Normalized chunking switches between a stricter and a looser cut mask
around the target chunk size. This greatly tightens the chunk-size
distribution (coefficient of variation ~0.9 -> ~0.3 in tests) and removes
the dedup-hostile max-size-clamped chunks, with unchanged deduplication.

chunker-params for buzhash64 gains a required 6th field, nc_level:

  buzhash64,chunk_min,chunk_max,chunk_mask,window_size,nc_level

Use nc_level=2 for the new default, nc_level=0 to disable (then behavior
is byte-identical to the previous single-mask chunker).

buzhash (32bit) is untouched and stays bit-compatible with borg 1.x.

The mask transition point (normal_size) defaults to a principled formula
(target minus the expected loose-phase tail) so the mean stays near the
target; it can be tuned via the normal_size constructor arg.

scripts/chunker_bench.py: evidence harness used to measure chunk-size
distribution, dedup ratio, throughput and shift-resilience.

Measurements (before = nc_level 0, after = nc_level 2; both at the default
params buzhash64,19,23,21,4095; measured with scripts/chunker_bench.py):

5 GiB of incompressible data (~2000-2700 chunks, statistically stable):

  before:  CV 0.739,  49 max-size-clamped (8 MiB) chunks,   953 MB/s
  after:   CV 0.311,   0 max-size-clamped chunks,          1024 MB/s

Re-backup of a 2.5 GiB file after a few scattered single-byte edits
(deduplication ratio; 0.5 = v2 fully deduplicated against v1, lower is
better):

   64 edits:  before 0.5424  ->  after 0.5235
  320 edits:  before 0.6791  ->  after 0.6142

Normalized chunking deduplicates better after edits: removing the
max-size-clamped chunks means a single-byte change invalidates much less
data (about 36% less dedup overhead at 320 edits). Throughput was also
consistently higher with nc_level=2 at this scale.

Also: fix bug when computing the mask, one needs to use 1ULL instead of
1, so the shifting computation is done in a uint64, not in a 32bit int.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ThomasWaldmann ThomasWaldmann force-pushed the buzhash64-normalized-chunking branch from a5d995d to 0e3876d Compare June 28, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant