buzhash64: add FastCDC-style normalized chunking by ThomasWaldmann · Pull Request #9823 · borgbackup/borg

ThomasWaldmann · 2026-06-27T18:42:04Z

Normalized chunking switches between a stricter and a looser cut mask around the target chunk size. This greatly tightens the chunk-size distribution (coefficient of variation ~0.9 -> ~0.3 in tests) and removes the dedup-hostile max-size-clamped chunks, with unchanged deduplication.

chunker-params for buzhash64 gains a required 6th field, nc_level:

buzhash64,chunk_min,chunk_max,chunk_mask,window_size,nc_level

Use nc_level=2 for the new default, nc_level=0 to disable (then behavior is byte-identical to the previous single-mask chunker).

buzhash (32bit) is untouched and stays bit-compatible with borg 1.x.

The mask transition point (normal_size) defaults to a principled formula (target minus the expected loose-phase tail) so the mean stays near the target; it can be tuned via the normal_size constructor arg.

scripts/chunker_bench.py: evidence harness used to measure chunk-size distribution, dedup ratio, throughput and shift-resilience.

Measurements (before = nc_level 0, after = nc_level 2; both at the default params buzhash64,19,23,21,4095; measured with scripts/chunker_bench.py):

5 GiB of incompressible data (~2000-2700 chunks, statistically stable):

before: CV 0.739, 49 max-size-clamped (8 MiB) chunks, 953 MB/s
after: CV 0.311, 0 max-size-clamped chunks, 1024 MB/s

Re-backup of a 2.5 GiB file after a few scattered single-byte edits (deduplication ratio; 0.5 = v2 fully deduplicated against v1, lower is better):

64 edits: before 0.5424 -> after 0.5235
320 edits: before 0.6791 -> after 0.6142

Normalized chunking deduplicates better after edits: removing the max-size-clamped chunks means a single-byte change invalidates much less data (about 36% less dedup overhead at 320 edits). Throughput was also consistently higher with nc_level=2 at this scale.

codecov · 2026-06-27T19:07:15Z

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.17%. Comparing base (106dfba) to head (0e3876d).
⚠️ Report is 10 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/borg/helpers/parseformat.py	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9823      +/-   ##
==========================================
+ Coverage   84.94%   85.17%   +0.23%     
==========================================
  Files          92       92              
  Lines       15291    15307      +16     
  Branches     2296     2301       +5     
==========================================
+ Hits        12989    13038      +49     
+ Misses       1611     1579      -32     
+ Partials      691      690       -1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Normalized chunking switches between a stricter and a looser cut mask around the target chunk size. This greatly tightens the chunk-size distribution (coefficient of variation ~0.9 -> ~0.3 in tests) and removes the dedup-hostile max-size-clamped chunks, with unchanged deduplication. chunker-params for buzhash64 gains a required 6th field, nc_level: buzhash64,chunk_min,chunk_max,chunk_mask,window_size,nc_level Use nc_level=2 for the new default, nc_level=0 to disable (then behavior is byte-identical to the previous single-mask chunker). buzhash (32bit) is untouched and stays bit-compatible with borg 1.x. The mask transition point (normal_size) defaults to a principled formula (target minus the expected loose-phase tail) so the mean stays near the target; it can be tuned via the normal_size constructor arg. scripts/chunker_bench.py: evidence harness used to measure chunk-size distribution, dedup ratio, throughput and shift-resilience. Measurements (before = nc_level 0, after = nc_level 2; both at the default params buzhash64,19,23,21,4095; measured with scripts/chunker_bench.py): 5 GiB of incompressible data (~2000-2700 chunks, statistically stable): before: CV 0.739, 49 max-size-clamped (8 MiB) chunks, 953 MB/s after: CV 0.311, 0 max-size-clamped chunks, 1024 MB/s Re-backup of a 2.5 GiB file after a few scattered single-byte edits (deduplication ratio; 0.5 = v2 fully deduplicated against v1, lower is better): 64 edits: before 0.5424 -> after 0.5235 320 edits: before 0.6791 -> after 0.6142 Normalized chunking deduplicates better after edits: removing the max-size-clamped chunks means a single-byte change invalidates much less data (about 36% less dedup overhead at 320 edits). Throughput was also consistently higher with nc_level=2 at this scale. Also: fix bug when computing the mask, one needs to use 1ULL instead of 1, so the shifting computation is done in a uint64, not in a 32bit int. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This was referenced Jun 27, 2026

chunking algorithms #5721

Open

borg2: implement new chunker? #8841

Open

Add fastcdc chunker (keyed Gear hash) #9824

Open

ThomasWaldmann force-pushed the buzhash64-normalized-chunking branch from a5d995d to 0e3876d Compare June 28, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

buzhash64: add FastCDC-style normalized chunking#9823

buzhash64: add FastCDC-style normalized chunking#9823
ThomasWaldmann wants to merge 1 commit into
borgbackup:masterfrom
ThomasWaldmann:buzhash64-normalized-chunking

ThomasWaldmann commented Jun 27, 2026

Uh oh!

codecov Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

ThomasWaldmann commented Jun 27, 2026

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 27, 2026 •

edited

Loading