Add Merkle Proof Generation for Chunk-Level Dataset Verification#10
Add Merkle Proof Generation for Chunk-Level Dataset Verification#10aniket866 wants to merge 6 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis pull request extends Merkle tree functionality in the utils module by introducing proof generation and verification capabilities, integrating Merkle roots into manifest generation, and establishing a new constant for chunk size. Comprehensive tests validate the deterministic behavior and correctness of Merkle root computation and proof verification workflows. Changes
Sequence DiagramsequenceDiagram
participant User
participant ManifestGen as Manifest Generator
participant MerkleRoot as Compute Merkle Root
participant MerkleProof as Generate Merkle Proof
participant MerkleVerify as Verify Merkle Proof
User->>ManifestGen: generate_manifest(raw_path, processed_path)
activate ManifestGen
ManifestGen->>MerkleRoot: compute_merkle_root(raw_path)
activate MerkleRoot
MerkleRoot-->>ManifestGen: raw_merkle_root
deactivate MerkleRoot
ManifestGen->>MerkleRoot: compute_merkle_root(processed_path)
activate MerkleRoot
MerkleRoot-->>ManifestGen: processed_merkle_root
deactivate MerkleRoot
ManifestGen-->>User: manifest {raw_merkle_root, processed_merkle_root, chunk_size_bytes}
deactivate ManifestGen
User->>MerkleProof: generate_merkle_proof(file_path, chunk_index)
activate MerkleProof
MerkleProof-->>User: proof: List[Tuple[hash, bool]]
deactivate MerkleProof
User->>MerkleVerify: verify_merkle_proof(chunk_bytes, proof, merkle_root)
activate MerkleVerify
MerkleVerify-->>User: bool (valid or invalid)
deactivate MerkleVerify
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 16-17: The functions that accept chunk_size (starting with
compute_merkle_root) do not validate chunk_size and can behave incorrectly for
non-positive values; add a guard at the start of each Merkle API function that
accepts chunk_size (e.g., compute_merkle_root and the other Merkle functions in
this file that take chunk_size) to raise a ValueError (or similar) when
chunk_size <= 0, with a clear message like "chunk_size must be a positive
integer", so the invalid input is rejected before any file/bytes/stream
processing occurs.
- Around line 95-117: verify_merkle_proof currently can raise ValueError when
parsing hex inputs (chunk_bytes, sibling_hex, merkle_root) or when the proof
structure is malformed; change it to fail closed by validating inputs and
catching parsing/structure errors and returning False instead of raising.
Specifically, inside verify_merkle_proof validate that proof is iterable of
pairs (sibling_hex, is_left), check merkle_root looks like a hex string of
correct length, and wrap bytes.fromhex(...) and any compute_sha256 conversions
in try/except (or pre-validate hex) so any ValueError/TypeError returns False;
keep the existing hashing/ordering logic (combined = sibling + current_hash or
current_hash + sibling) and only return True when final hash equals merkle_root,
otherwise return False on any malformed input or exception.
In `@tests/test_util.py`:
- Around line 200-212: Extend test_merkle_proof_verification to include a
negative/tamper case: after computing root via utils.compute_merkle_root and
obtaining proof via utils.generate_merkle_proof, read the original chunk as you
already do, then create a tampered copy (e.g., flip one byte or modify content)
and assert that utils.verify_merkle_proof(tampered_chunk, proof, root) returns
False; you can also add an additional assertion that altering the proof (instead
of the chunk) likewise causes utils.verify_merkle_proof to return False to
ensure verifier rejects tampered inputs.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
116-121:⚠️ Potential issue | 🟠 MajorFail closed when
proofis non-iterable.At Line 116,
for step in proof:will raiseTypeErrorfor malformed input likeNone. The verifier should returnFalsefor invalid proof input rather than raising.🔧 Proposed fix
def verify_merkle_proof( chunk_bytes: bytes, proof, merkle_root: str ) -> bool: @@ - for step in proof: + try: + steps = iter(proof) + except TypeError: + return False + + for step in steps: if not isinstance(step, (tuple, list)) or len(step) != 2: return False sibling_hex, is_left = step + if not isinstance(is_left, bool): + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 116 - 121, The loop over proof can raise a TypeError for non-iterable input (e.g., None); update the verifier to validate proof before iterating and return False for invalid inputs: add an early guard that checks proof is an iterable container (e.g., list/tuple or use try/except around the iteration) and return False when it is not, then proceed with the existing loop that inspects each step (sibling_hex, is_left); locate the code that does "for step in proof:" in openverifiablellm/utils.py and apply this input validation to fail closed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 116-121: The loop over proof can raise a TypeError for
non-iterable input (e.g., None); update the verifier to validate proof before
iterating and return False for invalid inputs: add an early guard that checks
proof is an iterable container (e.g., list/tuple or use try/except around the
iteration) and return False when it is not, then proceed with the existing loop
that inspects each step (sibling_hex, is_left); locate the code that does "for
step in proof:" in openverifiablellm/utils.py and apply this input validation to
fail closed.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
116-130:⚠️ Potential issue | 🟠 MajorFail closed for non-iterable
proofand non-boolean direction flags.At Line 116,
proof=None(or any non-iterable) raises instead of returningFalse. Also,is_leftshould be validated asboolto reject malformed proof steps deterministically.🔧 Proposed fix
def verify_merkle_proof( chunk_bytes: bytes, proof, merkle_root: str ) -> bool: @@ try: current_hash = bytes.fromhex(compute_sha256(chunk_bytes)) expected_root = bytes.fromhex(merkle_root) + proof_iter = iter(proof) except (TypeError, ValueError): return False - for step in proof: + for step in proof_iter: if not isinstance(step, (tuple, list)) or len(step) != 2: return False sibling_hex, is_left = step + if not isinstance(is_left, bool): + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 116 - 130, The loop over proof assumes proof is iterable and that step contents are well-formed; update the start of the routine (the function iterating over proof) to first return False if proof is None or not an iterable (e.g., use isinstance(proof, collections.abc.Iterable) or try/except TypeError on iter(proof)), and inside the loop validate each step: ensure step is a 2-tuple/list, sibling_hex is a hex string convertible via bytes.fromhex, and that is_left is strictly a bool (reject non-bool values by returning False) before building combined with sibling + current_hash or current_hash + sibling; keep existing bytes.fromhex exception handling but add explicit is_left type check to deterministically reject malformed steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 116-130: The loop over proof assumes proof is iterable and that
step contents are well-formed; update the start of the routine (the function
iterating over proof) to first return False if proof is None or not an iterable
(e.g., use isinstance(proof, collections.abc.Iterable) or try/except TypeError
on iter(proof)), and inside the loop validate each step: ensure step is a
2-tuple/list, sibling_hex is a hex string convertible via bytes.fromhex, and
that is_left is strictly a bool (reject non-bool values by returning False)
before building combined with sibling + current_hash or current_hash + sibling;
keep existing bytes.fromhex exception handling but add explicit is_left type
check to deterministically reject malformed steps.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 102-106: The verify_merkle_proof function's proof parameter is
missing a type hint; update the signature in verify_merkle_proof to add an
appropriate type (e.g., use typing.Sequence[str] or typing.List[str] if proofs
are hex/hash strings) and add the necessary import from typing (Sequence or
List) at the top of the module; if proof has a more complex structure (tuples or
dicts), use typing.Sequence[Tuple[str, str]] or typing.Sequence[Mapping[str,
Any]] accordingly to reflect the actual proof element shape.
- Around line 46-50: The function generate_merkle_proof lacks a return type
annotation; update its signature to include the return type described in the
docstring (for example, -> Tuple[List[bytes], bytes] or the exact type the
docstring specifies) so IDEs and type checkers can validate usage; modify the
def generate_merkle_proof(...) line to add the appropriate return annotation
while keeping existing parameter names (file_path, chunk_index, chunk_size) and
default MERKLE_CHUNK_SIZE_BYTES.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
openverifiablellm/utils.py (2)
6-11: 🧹 Nitpick | 🔵 TrivialRemove duplicate
Unionimport.
Unionis imported twice fromtyping(lines 6 and 11). Consolidate into a single import statement.🔧 Proposed fix
-from typing import Union import hashlib import logging import json import platform -from typing import Union, Optional +from typing import Union, Optional🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 6 - 11, Remove the duplicate "from typing import Union" import: consolidate the typing imports into a single statement that includes both Union and Optional (replace the two separate imports at the top that reference Union with one combined import), ensuring only one import from typing exists and there are no repeated symbols.
209-210:⚠️ Potential issue | 🔴 CriticalCritical: All
compute_sha256calls use positional arguments but the function signature requires keyword-only arguments.The
compute_sha256function signature (line 231) declares*making all parameters keyword-only. Lines 209-210 call it with positional arguments—passingstr(raw_path)andstr(processed_path)directly—which will raiseTypeError: compute_sha256() takes 0 positional arguments but 1 was givenat runtime.This issue appears across multiple call sites in the file. Update all calls to use keyword arguments:
- For byte data:
compute_sha256(data=<bytes>)- For file paths:
compute_sha256(file_path=<path>)For lines 209-210 specifically, use
file_pathparameter and remove thestr()conversion:- "raw_sha256": compute_sha256(str(raw_path)), - "processed_sha256": compute_sha256(str(processed_path)), + "raw_sha256": compute_sha256(file_path=raw_path), + "processed_sha256": compute_sha256(file_path=processed_path),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 209 - 210, compute_sha256 is defined with keyword-only parameters but is being called positionally (e.g., compute_sha256(str(raw_path)) and compute_sha256(str(processed_path))); update these calls (and all other positional uses) to use the correct keyword form. Replace compute_sha256(str(raw_path)) with compute_sha256(file_path=raw_path) and compute_sha256(str(processed_path)) with compute_sha256(file_path=processed_path) (and similarly use compute_sha256(data=...) where raw bytes are intended); adjust any other call sites to use the file_path= or data= keyword to match the compute_sha256 signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 6-11: Remove the duplicate "from typing import Union" import:
consolidate the typing imports into a single statement that includes both Union
and Optional (replace the two separate imports at the top that reference Union
with one combined import), ensuring only one import from typing exists and there
are no repeated symbols.
- Around line 209-210: compute_sha256 is defined with keyword-only parameters
but is being called positionally (e.g., compute_sha256(str(raw_path)) and
compute_sha256(str(processed_path))); update these calls (and all other
positional uses) to use the correct keyword form. Replace
compute_sha256(str(raw_path)) with compute_sha256(file_path=raw_path) and
compute_sha256(str(processed_path)) with
compute_sha256(file_path=processed_path) (and similarly use
compute_sha256(data=...) where raw bytes are intended); adjust any other call
sites to use the file_path= or data= keyword to match the compute_sha256
signature.
Closes #9
Summary
Previously, the system could:
Compute SHA256 hash for full dataset
Compute Merkle root for chunk-level hashing
But it could NOT:**
Prove that a specific chunk belongs to the dataset
With this PR update, users can now:
Generate a cryptographic proof for a specific data chunk
Verify that chunk without re-hashing the entire dataset
This completes the Merkle Tree implementation.
Problem:
We could compute a Merkle root
But we could not prove that a specific chunk is part of the dataset
Verification required rebuilding or re-hashing everything
Solution
This PR introduces two new functions:
generate_merkle_proof()verify_merkle_proof()Real-Life Example
Without Merkel_Proof:
To check if one book belongs to the library, you had to:
Re-scan the entire library inventory
Now with Merkle proof:
we get a small certificate (proof)
This certificate shows how that book connects to the main library record
weonly check a few records instead of all 1 million books
So instead of checking the entire library, we only check a small chain of references.
So this is what all:
thankyou
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests