Add fastcdc chunker (keyed Gear hash) by ThomasWaldmann · Pull Request #9824 · borgbackup/borg

ThomasWaldmann · 2026-06-27T22:45:45Z

Based on #9823.

New `fastcdc` chunker (keyed Gear hash)

A FastCDC content-defined chunker using the window-less Gear rolling hash
(fp = (fp << 1) + Gear[byte]), which is cheaper per byte than buzhash's
cyclic-polynomial update, so it chunks noticeably faster while producing the same
chunk-size distribution and deduplication.

The Gear table is keyed: derived from the repo id key via CSPRNG (own
fastcdc domain), exactly like the buzhash64 table, so chunk cut points stay
unpredictable without the key (anti-fingerprinting). It implements the same
FastCDC techniques as buzhash64 (sub-minimum skipping, normalized chunking with a
required nc_level, min/max clamping); the mask uses the high bits of the hash.

chunker-params: fastcdc,chunk_min,chunk_max,chunk_mask,nc_level — no window
field, because Gear is window-less. E.g. fastcdc,19,23,21,2.

borg benchmark cpu now measures the fastcdc chunker; tests live in
borg.testsuite.chunkers (golden vector, size distribution, keyed gear table,
param parsing, slow fuzz); docs and changelog updated.

Benchmarks

scripts/chunker_bench.py, buzhash64 vs fastcdc, both nc_level=2, incompressible
data unless noted:

corpus / target	metric	buzhash64	fastcdc
5 GiB, 2 MiB target	CV	0.294	0.295
	throughput	1011 MB/s	1313 MB/s (+30%)
64 MiB, 64 KiB target	CV	0.374	0.359
	shift-resilience	0.9928	0.9929
	throughput	963 MB/s	1331 MB/s (+38%)
2.5 GiB re-backup, 64 edits	dedup (lower=better)	0.5237	0.5236
2.5 GiB re-backup, 320 edits	dedup	0.6133	0.6161

borg benchmark cpu, 1 GB: fastcdc 3.80s, buzhash 4.36s, buzhash64 8.13s, fixed 0.56s.

Chunk-size distribution, deduplication and shift-resilience match buzhash64 within
noise; fastcdc is consistently faster.

🤖 Generated with Claude Code

codecov · 2026-06-27T23:24:39Z

Codecov Report

❌ Patch coverage is 81.25000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.17%. Comparing base (106dfba) to head (afa8189).
⚠️ Report is 10 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/borg/helpers/parseformat.py	66.66%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9824      +/-   ##
==========================================
+ Coverage   84.94%   85.17%   +0.22%     
==========================================
  Files          92       92              
  Lines       15291    15325      +34     
  Branches     2296     2307      +11     
==========================================
+ Hits        12989    13053      +64     
+ Misses       1611     1581      -30     
  Partials      691      691

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Normalized chunking switches between a stricter and a looser cut mask around the target chunk size. This greatly tightens the chunk-size distribution (coefficient of variation ~0.9 -> ~0.3 in tests) and removes the dedup-hostile max-size-clamped chunks, with unchanged deduplication. chunker-params for buzhash64 gains a required 6th field, nc_level: buzhash64,chunk_min,chunk_max,chunk_mask,window_size,nc_level Use nc_level=2 for the new default, nc_level=0 to disable (then behavior is byte-identical to the previous single-mask chunker). buzhash (32bit) is untouched and stays bit-compatible with borg 1.x. The mask transition point (normal_size) defaults to a principled formula (target minus the expected loose-phase tail) so the mean stays near the target; it can be tuned via the normal_size constructor arg. scripts/chunker_bench.py: evidence harness used to measure chunk-size distribution, dedup ratio, throughput and shift-resilience. Measurements (before = nc_level 0, after = nc_level 2; both at the default params buzhash64,19,23,21,4095; measured with scripts/chunker_bench.py): 5 GiB of incompressible data (~2000-2700 chunks, statistically stable): before: CV 0.739, 49 max-size-clamped (8 MiB) chunks, 953 MB/s after: CV 0.311, 0 max-size-clamped chunks, 1024 MB/s Re-backup of a 2.5 GiB file after a few scattered single-byte edits (deduplication ratio; 0.5 = v2 fully deduplicated against v1, lower is better): 64 edits: before 0.5424 -> after 0.5235 320 edits: before 0.6791 -> after 0.6142 Normalized chunking deduplicates better after edits: removing the max-size-clamped chunks means a single-byte change invalidates much less data (about 36% less dedup overhead at 320 edits). Throughput was also consistently higher with nc_level=2 at this scale. Also: fix bug when computing the mask, one needs to use 1ULL instead of 1, so the shifting computation is done in a uint64, not in a 32bit int. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a new "fastcdc" content-defined chunker selectable via --chunker-params. It uses the FastCDC Gear rolling hash (fp = (fp << 1) + Gear[byte]), which is window-less and cheaper per byte than buzhash's cyclic-polynomial update, so it chunks noticeably faster (see "borg benchmark cpu" output), while producing the same chunk-size distribution and deduplication. The Gear table is keyed: it is derived from the repo id key via CSPRNG (own "fastcdc" domain), exactly like the buzhash64 table, so chunk cut points stay unpredictable without the key (anti-fingerprinting). It implements the same FastCDC techniques as buzhash64 (sub-minimum skipping, normalized chunking with a required nc_level, min/max clamping); the mask uses the high bits of the hash (Gear accumulates entropy there). chunker-params: "fastcdc,chunk_min,chunk_max,chunk_mask,nc_level" - there is no window field, because Gear is window-less. e.g. fastcdc,19,23,21,2 Also: borg benchmark cpu now measures the fastcdc chunker; tests in borg.testsuite.chunkers (golden vector, size distribution, keyed gear table, param parsing, slow fuzz); docs and changelog. Benchmarks (scripts/chunker_bench.py, buzhash64 vs fastcdc, both nc_level=2, incompressible data unless noted): 5 GiB, 2 MiB target (default params): buzhash64: CV 0.294, 1011 MB/s fastcdc: CV 0.295, 1313 MB/s (+30%) 64 MiB, 64 KiB target: buzhash64: CV 0.374, shift-resilience 0.9928, 963 MB/s fastcdc: CV 0.359, shift-resilience 0.9929, 1331 MB/s (+38%) Re-backup of a 2.5 GiB file after scattered single-byte edits (dedup ratio, 0.5 = v2 fully deduplicated, lower is better): 64 edits: buzhash64 0.5237, fastcdc 0.5236 320 edits: buzhash64 0.6133, fastcdc 0.6161 borg benchmark cpu, 1 GB: fastcdc 3.80s, buzhash 4.36s, buzhash64 8.13s, fixed 0.56s. Chunk-size distribution, deduplication and shift-resilience match buzhash64 within noise; fastcdc is consistently faster. Also: fix bug when computing the mask, one needs to use 1ULL instead of 1, so the shifting computation is done in a uint64, not in a 32bit int. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ThomasWaldmann changed the title ~~Add fastcdc chunker (keyed Gear hash); buzhash64 normalized chunking~~ Add fastcdc chunker (keyed Gear hash) Jun 27, 2026

ThomasWaldmann force-pushed the fastcdc-chunker branch 2 times, most recently from c16e0fe to f41a414 Compare June 27, 2026 22:57

ThomasWaldmann and others added 2 commits June 28, 2026 12:00

ThomasWaldmann force-pushed the fastcdc-chunker branch from f41a414 to afa8189 Compare June 28, 2026 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add fastcdc chunker (keyed Gear hash)#9824

Add fastcdc chunker (keyed Gear hash)#9824
ThomasWaldmann wants to merge 2 commits into
borgbackup:masterfrom
ThomasWaldmann:fastcdc-chunker

ThomasWaldmann commented Jun 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

ThomasWaldmann commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New fastcdc chunker (keyed Gear hash)

Benchmarks

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ThomasWaldmann commented Jun 27, 2026 •

edited

Loading

New `fastcdc` chunker (keyed Gear hash)

codecov Bot commented Jun 27, 2026 •

edited

Loading