Add Merkle Tree–Based Chunk-Level Hashing for Dataset Verification this establishes dataset verification integrity .#8
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds Merkle-tree based file hashing: new Changes
Sequence DiagramsequenceDiagram
rect rgba(200,220,255,0.5)
participant User
end
rect rgba(200,255,200,0.5)
participant Manifest as generate_manifest
end
rect rgba(255,230,200,0.5)
participant Merkle as compute_merkle_root
end
participant FileSystem as "File System"
participant SHA256
User->>Manifest: request manifest(raw_path, processed_path)
Manifest->>Merkle: compute_merkle_root(raw_path)
Merkle->>FileSystem: read next chunk
FileSystem-->>Merkle: chunk bytes
Merkle->>SHA256: hash(chunk)
SHA256-->>Merkle: leaf hash
Merkle->>Merkle: aggregate pairs → parent hashes
Merkle-->>Manifest: raw_merkle_root
Manifest->>Merkle: compute_merkle_root(processed_path)
Merkle->>FileSystem: read next chunk
FileSystem-->>Merkle: chunk bytes
Merkle->>SHA256: hash(chunk)
SHA256-->>Merkle: leaf hash
Merkle->>Merkle: aggregate → root
Merkle-->>Manifest: processed_merkle_root
Manifest->>Manifest: write manifest with merkle roots & chunk_size
Manifest-->>User: return manifest JSON
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 15-40: The function compute_merkle_root currently allows
chunk_size <= 0 which lets the read loop skip data (e.g., chunk_size=0) and
produce incorrect hashes; before opening the file, validate the chunk_size
parameter (ensure it's an int and > 0) and raise a clear ValueError for invalid
values so callers cannot pass zero/negative sizes, then proceed with the
existing logic that reads chunks into leaves and hashes them; reference
compute_merkle_root, chunk_size, and leaves when adding this guard.
- Around line 126-129: Introduce a single module-level constant (e.g.,
CHUNK_SIZE_BYTES) and use it everywhere instead of the hardcoded literal:
replace the literal 1024 * 1024 in the manifest dict's "chunk_size_bytes" and
pass CHUNK_SIZE_BYTES into compute_merkle_root calls (or align
compute_merkle_root's default to this constant) so hashing and manifest metadata
share the same value; update any other uses of the 1024*1024 literal to
reference CHUNK_SIZE_BYTES and ensure the manifest key "chunk_size_bytes" stores
that constant.
In `@tests/test_merkle.py`:
- Around line 9-53: Add a regression test in tests/test_merkle.py to lock the
“duplicate last leaf” rule used by utils.compute_merkle_root (see
duplicate-last-leaf at openverifiablellm/utils.py line 51): write a file that
produces an odd number of chunks (e.g., content length 9 with chunk_size=4
yields 3 leaves), compute each chunk's SHA-256 hex digest, duplicate the final
leaf to make pairs, iteratively hash pairwise concatenated hex strings (using
hashlib.sha256(...).hexdigest()) until a single root is produced, and assert
that this manually computed root equals utils.compute_merkle_root(file,
chunk_size=4); name the test e.g.
test_merkle_root_duplicates_last_leaf_for_odd_chunks so future changes to tree
construction will be caught.
In `@tests/test_util.py`:
- Around line 143-148: The test reads the manifest as raw text and asserts
substrings, which can produce false positives; change it to parse the JSON from
manifest_file (use json.loads on manifest_file.read_text() or json.load on the
file) and then assert that the resulting dict contains keys "raw_merkle_root",
"processed_merkle_root", and "chunk_size_bytes" and validate types/values (e.g.,
merkle roots are strings and chunk_size_bytes is an int > 0) instead of checking
raw text; update references to manifest_file and manifest accordingly in
tests/test_util.py.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
.gitignoreopenverifiablellm/utils.pytests/test_merkle.pytests/test_util.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
129-131:⚠️ Potential issue | 🔴 CriticalDefine
MERKLE_CHUNK_SIZE_BYTESbefore use (current code crashes).Line 129, Line 130, and Line 131 reference
MERKLE_CHUNK_SIZE_BYTES, but it is not defined in this module, sogenerate_manifest()will fail withNameError.🔧 Proposed fix
logger = logging.getLogger(__name__) +MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024 # Merkle Tree Chunk-Level Hashing for Large Files -def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str: +def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str:#!/bin/bash set -e # Verify symbol usages and ensure there is a module-level assignment rg -n --type=py '\bMERKLE_CHUNK_SIZE_BYTES\b' ast-grep --pattern 'MERKLE_CHUNK_SIZE_BYTES = $_'Expected result: at least one assignment match plus the usage sites.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 129 - 131, generate_manifest() and the compute_merkle_root calls reference MERKLE_CHUNK_SIZE_BYTES which is not defined, causing a NameError; add a module-level constant definition (e.g., MERKLE_CHUNK_SIZE_BYTES = <appropriate integer>) near the top of the utils module so the symbol exists before it is used, or import it from the correct module if it belongs elsewhere, and then re-run tests that cover generate_manifest() and compute_merkle_root to verify the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_util.py`:
- Around line 143-145: Remove the stray diff marker "@@" from tests/test_util.py
(it causes a syntax error) and move the "import json" statement out of the
function body to the module-level imports at the top of the file with the other
imports; ensure the test function that defines "manifest_file = tmp_path /
'data/dataset_manifest.json'" no longer contains the stray token and that all
imports (including json) are declared before any function/class definitions so
the file parses correctly.
---
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 129-131: generate_manifest() and the compute_merkle_root calls
reference MERKLE_CHUNK_SIZE_BYTES which is not defined, causing a NameError; add
a module-level constant definition (e.g., MERKLE_CHUNK_SIZE_BYTES = <appropriate
integer>) near the top of the utils module so the symbol exists before it is
used, or import it from the correct module if it belongs elsewhere, and then
re-run tests that cover generate_manifest() and compute_merkle_root to verify
the fix.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/test_util.py (1)
133-149: 🧹 Nitpick | 🔵 TrivialConsider adding value/type validation for robustness.
The test correctly parses JSON (addressing the past review suggestion), but only checks field presence. Adding type and value assertions would catch more regressions.
♻️ Optional enhancement
assert "raw_merkle_root" in manifest assert "processed_merkle_root" in manifest assert "chunk_size_bytes" in manifest + # Validate types and expected values + assert isinstance(manifest["chunk_size_bytes"], int) + assert manifest["chunk_size_bytes"] > 0 + assert len(manifest["raw_merkle_root"]) == 64 # SHA256 hex length + assert len(manifest["processed_merkle_root"]) == 64🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 133 - 149, Update the test_manifest_contains_merkle_fields test to not only assert presence but also validate types/values: after calling utils.generate_manifest and loading manifest, assert that manifest["raw_merkle_root"] and manifest["processed_merkle_root"] are non-empty strings (or bytes-encoded hex strings) and that manifest["chunk_size_bytes"] is an integer > 0; keep references to the existing test function name and utils.generate_manifest so the assertions are added in the same test right after loading the manifest.openverifiablellm/utils.py (1)
106-111:⚠️ Potential issue | 🔴 Critical
MERKLE_CHUNK_SIZE_BYTESis undefined — code will raiseNameErrorat runtime.The constant is referenced on lines 108-110 but never defined in the module. Define it at module level and use it in the function default parameter for consistency.
Proposed fix
Add the constant after the logger (line 12):
logger = logging.getLogger(__name__) +MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024 + # Merkle Tree Chunk-Level Hashing for Large Files -def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str: +def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 106 - 111, The code references MERKLE_CHUNK_SIZE_BYTES when calling compute_merkle_root for "raw_merkle_root" and "processed_merkle_root" but that constant is not defined; add a module-level constant named MERKLE_CHUNK_SIZE_BYTES (e.g., near the logger definition) and then update compute_merkle_root's signature to use MERKLE_CHUNK_SIZE_BYTES as the default chunk_size parameter so callers like the code that sets raw_merkle_root/processed_merkle_root use the defined constant consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 106-111: The code references MERKLE_CHUNK_SIZE_BYTES when calling
compute_merkle_root for "raw_merkle_root" and "processed_merkle_root" but that
constant is not defined; add a module-level constant named
MERKLE_CHUNK_SIZE_BYTES (e.g., near the logger definition) and then update
compute_merkle_root's signature to use MERKLE_CHUNK_SIZE_BYTES as the default
chunk_size parameter so callers like the code that sets
raw_merkle_root/processed_merkle_root use the defined constant consistently.
In `@tests/test_util.py`:
- Around line 133-149: Update the test_manifest_contains_merkle_fields test to
not only assert presence but also validate types/values: after calling
utils.generate_manifest and loading manifest, assert that
manifest["raw_merkle_root"] and manifest["processed_merkle_root"] are non-empty
strings (or bytes-encoded hex strings) and that manifest["chunk_size_bytes"] is
an integer > 0; keep references to the existing test function name and
utils.generate_manifest so the assertions are added in the same test right after
loading the manifest.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
openverifiablellm/utils.py (2)
15-20:⚠️ Potential issue | 🔴 CriticalReject invalid
chunk_sizebefore reading file data.
chunk_sizeis not validated. Withchunk_size=0, Line [20] reads nothing and returns the empty-file hash for non-empty files, which breaks verification semantics. This appears to be a regression of an earlier finding.🔧 Proposed fix
-def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = 1024 * 1024) -> str: +def compute_merkle_root(file_path: Union[str, Path], chunk_size: int = MERKLE_CHUNK_SIZE_BYTES) -> str: + if not isinstance(chunk_size, int) or chunk_size <= 0: + raise ValueError("chunk_size must be a positive integer") + path = Path(file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 15 - 20, The compute_merkle_root function accepts an unvalidated chunk_size which allows 0 or negative values causing the file read loop (while chunk := f.read(chunk_size)) to behave incorrectly; at the top of compute_merkle_root validate chunk_size (ensure it's an int and > 0) and raise a ValueError for invalid values before opening the file or reading, so callers cannot pass 0/negative sizes and the rest of the logic (the file read loop and Merkle leaf construction) can assume a valid chunk size.
108-110:⚠️ Potential issue | 🔴 Critical
MERKLE_CHUNK_SIZE_BYTESis undefined and will crash manifest generation.Lines [108]-[110] reference a missing symbol, causing
NameErrorat runtime.🔧 Proposed fix
logger = logging.getLogger(__name__) +MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 108 - 110, The manifest generation references an undefined constant MERKLE_CHUNK_SIZE_BYTES in the compute_merkle_root calls for raw_merkle_root and processed_merkle_root, causing a NameError; fix by either defining MERKLE_CHUNK_SIZE_BYTES (e.g., set a sensible default near the top of openverifiablellm/utils.py) or replace it with the existing correct symbol (e.g., MERKLE_CHUNK_SIZE or DEFAULT_CHUNK_SIZE) used elsewhere in the module so compute_merkle_root(raw_path, chunk_size=...) and compute_merkle_root(processed_path, chunk_size=...) use a defined constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 126-137: The type annotation of compute_sha256 is inconsistent
with its runtime check for bytearray; update the function signature for
compute_sha256 to include bytearray in the Union (e.g., Union[str, Path, bytes,
bytearray] or reorder to Union[bytes, bytearray, str, Path]) so static typing
matches the isinstance(file_path, (bytes, bytearray)) branch, and add or update
any necessary typing imports (from typing import Union) if required.
---
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 15-20: The compute_merkle_root function accepts an unvalidated
chunk_size which allows 0 or negative values causing the file read loop (while
chunk := f.read(chunk_size)) to behave incorrectly; at the top of
compute_merkle_root validate chunk_size (ensure it's an int and > 0) and raise a
ValueError for invalid values before opening the file or reading, so callers
cannot pass 0/negative sizes and the rest of the logic (the file read loop and
Merkle leaf construction) can assume a valid chunk size.
- Around line 108-110: The manifest generation references an undefined constant
MERKLE_CHUNK_SIZE_BYTES in the compute_merkle_root calls for raw_merkle_root and
processed_merkle_root, causing a NameError; fix by either defining
MERKLE_CHUNK_SIZE_BYTES (e.g., set a sensible default near the top of
openverifiablellm/utils.py) or replace it with the existing correct symbol
(e.g., MERKLE_CHUNK_SIZE or DEFAULT_CHUNK_SIZE) used elsewhere in the module so
compute_merkle_root(raw_path, chunk_size=...) and
compute_merkle_root(processed_path, chunk_size=...) use a defined constant.
|
Hii @Archit381 I have made the changes you suggested , Please have a look |
|
@aniket866 In |
Sure |
|
hi @Archit381 I have changed |
|
@aniket866 data doesnt seem the best if someone has to pass a |
Closes #9
Summary
This PR introduces Merkle Tree–based chunk-level hashing to enhance dataset integrity verification.
Previously, the pipeline computed a single SHA256 hash for the entire raw and processed dataset. While this ensured file-level integrity, it required re-hashing the full dataset (potentially multi-gigabyte) to verify any portion of the data.
Problem Statement
Current approach has limitations:
Requires re-hashing the entire dataset to verify integrity
Not scalable for large datasets
No support for partial verification
No cryptographic structure for subset validation
For large training corpora, this becomes inefficient and impractical.
Proposed Solution
Implemented a Merkle Tree–based hashing system that:
Splits files into fixed-size chunks (default: 1MB)
Computes SHA256 hash for each chunk (leaf nodes)
Builds a Merkle Tree using raw-byte concatenation
Produces a Merkle Root representing the entire dataset
Stores the Merkle Root in the dataset manifest
This allows cryptographic verification of specific chunks without re-hashing the entire file.
Implementation Details
Added
compute_merkle_root()functionReads file in chunks
Hashes each chunk using SHA256
Builds Merkle Tree bottom-up
Returns final Merkle Root (hex string)
Updated
generate_manifest()to include:raw_merkle_rootprocessed_merkle_rootchunk_size_bytesManifest Example (New Fields)
{ "raw_sha256": "...", "processed_sha256": "...", "raw_merkle_root": "...", "processed_merkle_root": "...", "chunk_size_bytes": 1048576 }Benefits
• Enables chunk-level verification
• Scales efficiently for large datasets
• Maintains deterministic hashing
• Improves reproducibility
• Aligns with blockchain-style verification models
• Backward compatible
Testing
Added and validated tests for:
Merkle root determinism
Content modification detection
Single-chunk compatibility with SHA256
Empty file handling
Manifest field presence
All tests pass:
16 passed in 1.88s
Impact
This change upgrades the pipeline from simple file hashing to structured cryptographic verification.
It significantly improves scalability and verification robustness for large training datasets while preserving existing functionality.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
@Archit381 Please check this out
Summary by CodeRabbit
New Features
Tests