Add Merkle Tree–Based Chunk-Level Hashing for Dataset Verification this establishes dataset verification integrity . by aniket866 · Pull Request #8 · AOSSIE-Org/OpenVerifiableLLM

aniket866 · 2026-02-25T11:22:47Z

Closes #9

Summary

This PR introduces Merkle Tree–based chunk-level hashing to enhance dataset integrity verification.

Previously, the pipeline computed a single SHA256 hash for the entire raw and processed dataset. While this ensured file-level integrity, it required re-hashing the full dataset (potentially multi-gigabyte) to verify any portion of the data.

With this update, the system now supports chunk-level cryptographic verification using a Merkle Tree construction.

Problem Statement

The existing pipeline only generated a single SHA256 checksum for:

Raw Wikipedia dump

Processed cleaned dataset

Current approach has limitations:

Requires re-hashing the entire dataset to verify integrity
Not scalable for large datasets
No support for partial verification
No cryptographic structure for subset validation
For large training corpora, this becomes inefficient and impractical.

Proposed Solution

Implemented a Merkle Tree–based hashing system that:
Splits files into fixed-size chunks (default: 1MB)
Computes SHA256 hash for each chunk (leaf nodes)
Builds a Merkle Tree using raw-byte concatenation
Produces a Merkle Root representing the entire dataset
Stores the Merkle Root in the dataset manifest
This allows cryptographic verification of specific chunks without re-hashing the entire file.

Implementation Details

Added compute_merkle_root() function
Reads file in chunks
Hashes each chunk using SHA256
Builds Merkle Tree bottom-up
Returns final Merkle Root (hex string)

Updated `generate_manifest()` to include:

raw_merkle_root
processed_merkle_root

chunk_size_bytes

Manifest Example (New Fields)

{
  "raw_sha256": "...",
  "processed_sha256": "...",
  "raw_merkle_root": "...",
  "processed_merkle_root": "...",
  "chunk_size_bytes": 1048576
}

Benefits

• Enables chunk-level verification
• Scales efficiently for large datasets
• Maintains deterministic hashing
• Improves reproducibility
• Aligns with blockchain-style verification models
• Backward compatible

Testing

Added and validated tests for:
Merkle root determinism
Content modification detection
Single-chunk compatibility with SHA256
Empty file handling
Manifest field presence

All tests pass:

16 passed in 1.88s

Impact

This change upgrades the pipeline from simple file hashing to structured cryptographic verification.
It significantly improves scalability and verification robustness for large training datasets while preserving existing functionality.

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

@Archit381 Please check this out

Summary by CodeRabbit

New Features
- Added Merkle-root based cryptographic checksums for file verification.
- Manifest now includes Merkle root hashes and chunk-size metadata for raw and processed files.
- Hashing enhanced to accept additional input formats for verification.
Tests
- Added tests for Merkle determinism, content-change detection, single-chunk and empty-file edge cases.

coderabbitai · 2026-02-25T11:31:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds Merkle-tree based file hashing: new compute_merkle_root() that chunks files, hashes chunks with SHA-256, and builds a deterministic Merkle root. compute_sha256() now accepts bytes input. generate_manifest() is extended to include raw_merkle_root, processed_merkle_root, and chunk_size_bytes. Also adds *.venv to .gitignore.

Changes

Cohort / File(s)	Summary
Configuration `\.gitignore`	Added `*.venv` ignore pattern for Python virtual environment folders.
Merkle & Hashing Utilities `openverifiablellm/utils.py`	Added `compute_merkle_root(file_path, chunk_size=...) -> str`. Updated `compute_sha256()` to accept `Union[str, Path, bytes]`. `generate_manifest()` now includes `raw_merkle_root`, `processed_merkle_root`, and `chunk_size_bytes`.
Tests `tests/test_util.py`	Added tests for Merkle-related behavior: manifest contains merkle fields, determinism, sensitivity to content changes, single-chunk equality with SHA-256, and empty-file handling.

Sequence Diagram

sequenceDiagram
    rect rgba(200,220,255,0.5)
    participant User
    end
    rect rgba(200,255,200,0.5)
    participant Manifest as generate_manifest
    end
    rect rgba(255,230,200,0.5)
    participant Merkle as compute_merkle_root
    end
    participant FileSystem as "File System"
    participant SHA256

    User->>Manifest: request manifest(raw_path, processed_path)
    Manifest->>Merkle: compute_merkle_root(raw_path)
    Merkle->>FileSystem: read next chunk
    FileSystem-->>Merkle: chunk bytes
    Merkle->>SHA256: hash(chunk)
    SHA256-->>Merkle: leaf hash
    Merkle->>Merkle: aggregate pairs → parent hashes
    Merkle-->>Manifest: raw_merkle_root
    Manifest->>Merkle: compute_merkle_root(processed_path)
    Merkle->>FileSystem: read next chunk
    FileSystem-->>Merkle: chunk bytes
    Merkle->>SHA256: hash(chunk)
    SHA256-->>Merkle: leaf hash
    Merkle->>Merkle: aggregate → root
    Merkle-->>Manifest: processed_merkle_root
    Manifest->>Manifest: write manifest with merkle roots & chunk_size
    Manifest-->>User: return manifest JSON

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

Python Lang

Poem

🐰 I nibble bytes in tidy chunks,

hashes hop and leaf by leaf,
roots arise from little funks,
manifests hold my belief.
🌿✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title describes the main change (adding Merkle Tree–based chunk-level hashing) but is somewhat verbose and contains unclear phrasing 'this establishes dataset verification integrity' that doesn't add clarity.
Linked Issues check	✅ Passed	The PR successfully implements all coding requirements from issue `#9`: compute_merkle_root function, updated generate_manifest with Merkle fields, bytes support in compute_sha256, and comprehensive tests for determinism, content changes, single-chunk equality, and empty files.
Out of Scope Changes check	✅ Passed	The .gitignore change to add *.venv pattern is a minor, supporting change for development environment setup and is appropriately scoped to the PR's needs.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 15-40: The function compute_merkle_root currently allows
chunk_size <= 0 which lets the read loop skip data (e.g., chunk_size=0) and
produce incorrect hashes; before opening the file, validate the chunk_size
parameter (ensure it's an int and > 0) and raise a clear ValueError for invalid
values so callers cannot pass zero/negative sizes, then proceed with the
existing logic that reads chunks into leaves and hashes them; reference
compute_merkle_root, chunk_size, and leaves when adding this guard.
- Around line 126-129: Introduce a single module-level constant (e.g.,
CHUNK_SIZE_BYTES) and use it everywhere instead of the hardcoded literal:
replace the literal 1024 * 1024 in the manifest dict's "chunk_size_bytes" and
pass CHUNK_SIZE_BYTES into compute_merkle_root calls (or align
compute_merkle_root's default to this constant) so hashing and manifest metadata
share the same value; update any other uses of the 1024*1024 literal to
reference CHUNK_SIZE_BYTES and ensure the manifest key "chunk_size_bytes" stores
that constant.

In `@tests/test_merkle.py`:
- Around line 9-53: Add a regression test in tests/test_merkle.py to lock the
“duplicate last leaf” rule used by utils.compute_merkle_root (see
duplicate-last-leaf at openverifiablellm/utils.py line 51): write a file that
produces an odd number of chunks (e.g., content length 9 with chunk_size=4
yields 3 leaves), compute each chunk's SHA-256 hex digest, duplicate the final
leaf to make pairs, iteratively hash pairwise concatenated hex strings (using
hashlib.sha256(...).hexdigest()) until a single root is produced, and assert
that this manually computed root equals utils.compute_merkle_root(file,
chunk_size=4); name the test e.g.
test_merkle_root_duplicates_last_leaf_for_odd_chunks so future changes to tree
construction will be caught.

In `@tests/test_util.py`:
- Around line 143-148: The test reads the manifest as raw text and asserts
substrings, which can produce false positives; change it to parse the JSON from
manifest_file (use json.loads on manifest_file.read_text() or json.load on the
file) and then assert that the resulting dict contains keys "raw_merkle_root",
"processed_merkle_root", and "chunk_size_bytes" and validate types/values (e.g.,
merkle roots are strings and chunk_size_bytes is an int > 0) instead of checking
raw text; update references to manifest_file and manifest accordingly in
tests/test_util.py.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c70d97d and b970aed.

📒 Files selected for processing (4)

.gitignore
openverifiablellm/utils.py
tests/test_merkle.py
tests/test_util.py

openverifiablellm/utils.py

tests/test_merkle.py

tests/test_util.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

openverifiablellm/utils.py (1)

129-131: ⚠️ Potential issue | 🔴 Critical

Define MERKLE_CHUNK_SIZE_BYTES before use (current code crashes).

Line 129, Line 130, and Line 131 reference MERKLE_CHUNK_SIZE_BYTES, but it is not defined in this module, so generate_manifest() will fail with NameError.

🔧 Proposed fix

 logger = logging.getLogger(__name__)
+MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024
 
 # Merkle Tree Chunk-Level Hashing for Large Files
-def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str:
+def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str:

#!/bin/bash
set -e
# Verify symbol usages and ensure there is a module-level assignment
rg -n --type=py '\bMERKLE_CHUNK_SIZE_BYTES\b'
ast-grep --pattern 'MERKLE_CHUNK_SIZE_BYTES = $_'

Expected result: at least one assignment match plus the usage sites.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 129 - 131, generate_manifest() and
the compute_merkle_root calls reference MERKLE_CHUNK_SIZE_BYTES which is not
defined, causing a NameError; add a module-level constant definition (e.g.,
MERKLE_CHUNK_SIZE_BYTES = <appropriate integer>) near the top of the utils
module so the symbol exists before it is used, or import it from the correct
module if it belongs elsewhere, and then re-run tests that cover
generate_manifest() and compute_merkle_root to verify the fix.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_util.py`:
- Around line 143-145: Remove the stray diff marker "@@" from tests/test_util.py
(it causes a syntax error) and move the "import json" statement out of the
function body to the module-level imports at the top of the file with the other
imports; ensure the test function that defines "manifest_file = tmp_path /
'data/dataset_manifest.json'" no longer contains the stray token and that all
imports (including json) are declared before any function/class definitions so
the file parses correctly.

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 129-131: generate_manifest() and the compute_merkle_root calls
reference MERKLE_CHUNK_SIZE_BYTES which is not defined, causing a NameError; add
a module-level constant definition (e.g., MERKLE_CHUNK_SIZE_BYTES = <appropriate
integer>) near the top of the utils module so the symbol exists before it is
used, or import it from the correct module if it belongs elsewhere, and then
re-run tests that cover generate_manifest() and compute_merkle_root to verify
the fix.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b970aed and e58ef1d.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

tests/test_util.py

coderabbitai

♻️ Duplicate comments (2)

tests/test_util.py (1)

133-149: 🧹 Nitpick | 🔵 Trivial

Consider adding value/type validation for robustness.

The test correctly parses JSON (addressing the past review suggestion), but only checks field presence. Adding type and value assertions would catch more regressions.

♻️ Optional enhancement

     assert "raw_merkle_root" in manifest
     assert "processed_merkle_root" in manifest
     assert "chunk_size_bytes" in manifest
+    # Validate types and expected values
+    assert isinstance(manifest["chunk_size_bytes"], int)
+    assert manifest["chunk_size_bytes"] > 0
+    assert len(manifest["raw_merkle_root"]) == 64  # SHA256 hex length
+    assert len(manifest["processed_merkle_root"]) == 64

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 133 - 149, Update the
test_manifest_contains_merkle_fields test to not only assert presence but also
validate types/values: after calling utils.generate_manifest and loading
manifest, assert that manifest["raw_merkle_root"] and
manifest["processed_merkle_root"] are non-empty strings (or bytes-encoded hex
strings) and that manifest["chunk_size_bytes"] is an integer > 0; keep
references to the existing test function name and utils.generate_manifest so the
assertions are added in the same test right after loading the manifest.

openverifiablellm/utils.py (1)

106-111: ⚠️ Potential issue | 🔴 Critical

MERKLE_CHUNK_SIZE_BYTES is undefined — code will raise NameError at runtime.

The constant is referenced on lines 108-110 but never defined in the module. Define it at module level and use it in the function default parameter for consistency.

Proposed fix

Add the constant after the logger (line 12):

 logger = logging.getLogger(__name__)
 
+MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024
+
 # Merkle Tree Chunk-Level Hashing for Large Files
-def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str:
+def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 106 - 111, The code references
MERKLE_CHUNK_SIZE_BYTES when calling compute_merkle_root for "raw_merkle_root"
and "processed_merkle_root" but that constant is not defined; add a module-level
constant named MERKLE_CHUNK_SIZE_BYTES (e.g., near the logger definition) and
then update compute_merkle_root's signature to use MERKLE_CHUNK_SIZE_BYTES as
the default chunk_size parameter so callers like the code that sets
raw_merkle_root/processed_merkle_root use the defined constant consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 106-111: The code references MERKLE_CHUNK_SIZE_BYTES when calling
compute_merkle_root for "raw_merkle_root" and "processed_merkle_root" but that
constant is not defined; add a module-level constant named
MERKLE_CHUNK_SIZE_BYTES (e.g., near the logger definition) and then update
compute_merkle_root's signature to use MERKLE_CHUNK_SIZE_BYTES as the default
chunk_size parameter so callers like the code that sets
raw_merkle_root/processed_merkle_root use the defined constant consistently.

In `@tests/test_util.py`:
- Around line 133-149: Update the test_manifest_contains_merkle_fields test to
not only assert presence but also validate types/values: after calling
utils.generate_manifest and loading manifest, assert that
manifest["raw_merkle_root"] and manifest["processed_merkle_root"] are non-empty
strings (or bytes-encoded hex strings) and that manifest["chunk_size_bytes"] is
an integer > 0; keep references to the existing test function name and
utils.generate_manifest so the assertions are added in the same test right after
loading the manifest.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 593692a and 8b14fa9.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

openverifiablellm/utils.py (2)

15-20: ⚠️ Potential issue | 🔴 Critical

Reject invalid chunk_size before reading file data.

chunk_size is not validated. With chunk_size=0, Line [20] reads nothing and returns the empty-file hash for non-empty files, which breaks verification semantics. This appears to be a regression of an earlier finding.

🔧 Proposed fix

-def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str:
+def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str:
+    if not isinstance(chunk_size, int) or chunk_size <= 0:
+        raise ValueError("chunk_size must be a positive integer")
+
     path = Path(file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 15 - 20, The compute_merkle_root
function accepts an unvalidated chunk_size which allows 0 or negative values
causing the file read loop (while chunk := f.read(chunk_size)) to behave
incorrectly; at the top of compute_merkle_root validate chunk_size (ensure it's
an int and > 0) and raise a ValueError for invalid values before opening the
file or reading, so callers cannot pass 0/negative sizes and the rest of the
logic (the file read loop and Merkle leaf construction) can assume a valid chunk
size.

108-110: ⚠️ Potential issue | 🔴 Critical

MERKLE_CHUNK_SIZE_BYTES is undefined and will crash manifest generation.

Lines [108]-[110] reference a missing symbol, causing NameError at runtime.

🔧 Proposed fix

 logger = logging.getLogger(__name__)
+MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 108 - 110, The manifest generation
references an undefined constant MERKLE_CHUNK_SIZE_BYTES in the
compute_merkle_root calls for raw_merkle_root and processed_merkle_root, causing
a NameError; fix by either defining MERKLE_CHUNK_SIZE_BYTES (e.g., set a
sensible default near the top of openverifiablellm/utils.py) or replace it with
the existing correct symbol (e.g., MERKLE_CHUNK_SIZE or DEFAULT_CHUNK_SIZE) used
elsewhere in the module so compute_merkle_root(raw_path, chunk_size=...) and
compute_merkle_root(processed_path, chunk_size=...) use a defined constant.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 126-137: The type annotation of compute_sha256 is inconsistent
with its runtime check for bytearray; update the function signature for
compute_sha256 to include bytearray in the Union (e.g., Union[str, Path, bytes,
bytearray] or reorder to Union[bytes, bytearray, str, Path]) so static typing
matches the isinstance(file_path, (bytes, bytearray)) branch, and add or update
any necessary typing imports (from typing import Union) if required.

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 15-20: The compute_merkle_root function accepts an unvalidated
chunk_size which allows 0 or negative values causing the file read loop (while
chunk := f.read(chunk_size)) to behave incorrectly; at the top of
compute_merkle_root validate chunk_size (ensure it's an int and > 0) and raise a
ValueError for invalid values before opening the file or reading, so callers
cannot pass 0/negative sizes and the rest of the logic (the file read loop and
Merkle leaf construction) can assume a valid chunk size.
- Around line 108-110: The manifest generation references an undefined constant
MERKLE_CHUNK_SIZE_BYTES in the compute_merkle_root calls for raw_merkle_root and
processed_merkle_root, causing a NameError; fix by either defining
MERKLE_CHUNK_SIZE_BYTES (e.g., set a sensible default near the top of
openverifiablellm/utils.py) or replace it with the existing correct symbol
(e.g., MERKLE_CHUNK_SIZE or DEFAULT_CHUNK_SIZE) used elsewhere in the module so
compute_merkle_root(raw_path, chunk_size=...) and
compute_merkle_root(processed_path, chunk_size=...) use a defined constant.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b14fa9 and dbef17a.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

aniket866 · 2026-02-26T07:43:39Z

Hii @Archit381 I have made the changes you suggested , Please have a look

Archit381 · 2026-02-26T19:58:39Z

@aniket866 In openverifiablellm/utils.py, pls update the parameter file_path to some other name since this function now also works on bytes/bytearray

aniket866 · 2026-02-26T20:00:27Z

@aniket866 In openverifiablellm/utils.py, pls update the parameter file_path to some other name since this function now also works on bytes/bytearray

Sure

aniket866 · 2026-02-26T20:06:34Z

hi @Archit381 I have changed
file_path is now data
have a look

Archit381 · 2026-02-27T09:45:27Z

@aniket866 data doesnt seem the best if someone has to pass a file_path for loading. Maybe add another param, file_path as an Optional

adding merkel-root-tree-based-verification

b970aed

github-actions bot added backend python size/M first-time-contributor pending-coderabbit-review bug Something isn't working and removed size/M labels Feb 25, 2026