Skip to content

Add Merkle Proof Generation for Chunk-Level Dataset Verification#10

Open
aniket866 wants to merge 6 commits intoAOSSIE-Org:mainfrom
aniket866:Merkel-root-Proof
Open

Add Merkle Proof Generation for Chunk-Level Dataset Verification#10
aniket866 wants to merge 6 commits intoAOSSIE-Org:mainfrom
aniket866:Merkel-root-Proof

Conversation

@aniket866
Copy link
Contributor

@aniket866 aniket866 commented Feb 26, 2026

Closes #9

Summary

  • This PR adds Merkle proof generation and verification to the project.

Previously, the system could:

  • Compute SHA256 hash for full dataset

  • Compute Merkle root for chunk-level hashing


  • But it could NOT:**

  • Prove that a specific chunk belongs to the dataset


With this PR update, users can now:

  • Generate a cryptographic proof for a specific data chunk

  • Verify that chunk without re-hashing the entire dataset

  • This completes the Merkle Tree implementation.

Problem:

  • We could compute a Merkle root

  • But we could not prove that a specific chunk is part of the dataset

  • Verification required rebuilding or re-hashing everything

Solution

This PR introduces two new functions:

generate_merkle_proof()

  • Generates a proof for a specific chunk index.

verify_merkle_proof()

  • Verifies that the chunk belongs to the dataset using:

Real-Life Example

  1. Imagine a big library with 1 million books.

Without Merkel_Proof:

  • To check if one book belongs to the library, you had to:

  • Re-scan the entire library inventory

Now with Merkle proof:

  • we get a small certificate (proof)

  • This certificate shows how that book connects to the main library record

  • weonly check a few records instead of all 1 million books

So instead of checking the entire library, we only check a small chain of references.

So this is what all:

  • That is what Merkle proof does for datasets.
    thankyou

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added cryptographic merkle proof generation and verification for enhanced file integrity verification
    • Manifest output now includes merkle root values and chunk size information for improved data integrity tracking
  • Tests

    • Added comprehensive test coverage for merkle root calculations, proof verification, and manifest generation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request extends Merkle tree functionality in the utils module by introducing proof generation and verification capabilities, integrating Merkle roots into manifest generation, and establishing a new constant for chunk size. Comprehensive tests validate the deterministic behavior and correctness of Merkle root computation and proof verification workflows.

Changes

Cohort / File(s) Summary
Merkle Utilities & Manifest Integration
openverifiablellm/utils.py
Introduced MERKLE_CHUNK_SIZE_BYTES constant; added generate_merkle_proof() and verify_merkle_proof() functions for proof construction and validation; updated compute_merkle_root() to use constant; extended generate_manifest() with raw_merkle_root, processed_merkle_root, and chunk_size_bytes fields.
Test Coverage for Merkle Functionality
tests/test_util.py
Added tests verifying manifest Merkle field presence, root determinism, content sensitivity, single-chunk equivalence to SHA256, empty file handling, and proof verification (positive and negative cases).

Sequence Diagram

sequenceDiagram
    participant User
    participant ManifestGen as Manifest Generator
    participant MerkleRoot as Compute Merkle Root
    participant MerkleProof as Generate Merkle Proof
    participant MerkleVerify as Verify Merkle Proof

    User->>ManifestGen: generate_manifest(raw_path, processed_path)
    activate ManifestGen
    ManifestGen->>MerkleRoot: compute_merkle_root(raw_path)
    activate MerkleRoot
    MerkleRoot-->>ManifestGen: raw_merkle_root
    deactivate MerkleRoot
    ManifestGen->>MerkleRoot: compute_merkle_root(processed_path)
    activate MerkleRoot
    MerkleRoot-->>ManifestGen: processed_merkle_root
    deactivate MerkleRoot
    ManifestGen-->>User: manifest {raw_merkle_root, processed_merkle_root, chunk_size_bytes}
    deactivate ManifestGen

    User->>MerkleProof: generate_merkle_proof(file_path, chunk_index)
    activate MerkleProof
    MerkleProof-->>User: proof: List[Tuple[hash, bool]]
    deactivate MerkleProof

    User->>MerkleVerify: verify_merkle_proof(chunk_bytes, proof, merkle_root)
    activate MerkleVerify
    MerkleVerify-->>User: bool (valid or invalid)
    deactivate MerkleVerify
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

  • #8: Shares overlapping modifications to compute_merkle_root() and manifest integration with raw/processed Merkle root fields.

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 Merkle trees dance in perfect symmetry,
Chunks arranged with cryptographic poetry,
Proofs bloom upward to roots so true,
Verification whispers: "I vouch for you!" 🌿✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding Merkle proof generation for chunk-level verification of datasets.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/L and removed size/L labels Feb 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 16-17: The functions that accept chunk_size (starting with
compute_merkle_root) do not validate chunk_size and can behave incorrectly for
non-positive values; add a guard at the start of each Merkle API function that
accepts chunk_size (e.g., compute_merkle_root and the other Merkle functions in
this file that take chunk_size) to raise a ValueError (or similar) when
chunk_size <= 0, with a clear message like "chunk_size must be a positive
integer", so the invalid input is rejected before any file/bytes/stream
processing occurs.
- Around line 95-117: verify_merkle_proof currently can raise ValueError when
parsing hex inputs (chunk_bytes, sibling_hex, merkle_root) or when the proof
structure is malformed; change it to fail closed by validating inputs and
catching parsing/structure errors and returning False instead of raising.
Specifically, inside verify_merkle_proof validate that proof is iterable of
pairs (sibling_hex, is_left), check merkle_root looks like a hex string of
correct length, and wrap bytes.fromhex(...) and any compute_sha256 conversions
in try/except (or pre-validate hex) so any ValueError/TypeError returns False;
keep the existing hashing/ordering logic (combined = sibling + current_hash or
current_hash + sibling) and only return True when final hash equals merkle_root,
otherwise return False on any malformed input or exception.

In `@tests/test_util.py`:
- Around line 200-212: Extend test_merkle_proof_verification to include a
negative/tamper case: after computing root via utils.compute_merkle_root and
obtaining proof via utils.generate_merkle_proof, read the original chunk as you
already do, then create a tampered copy (e.g., flip one byte or modify content)
and assert that utils.verify_merkle_proof(tampered_chunk, proof, root) returns
False; you can also add an additional assertion that altering the proof (instead
of the chunk) likewise causes utils.verify_merkle_proof to return False to
ensure verifier rejects tampered inputs.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cb91e9 and 85d8e3f.

📒 Files selected for processing (3)
  • .gitignore
  • openverifiablellm/utils.py
  • tests/test_util.py

@github-actions github-actions bot added size/L and removed size/L labels Feb 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

116-121: ⚠️ Potential issue | 🟠 Major

Fail closed when proof is non-iterable.

At Line 116, for step in proof: will raise TypeError for malformed input like None. The verifier should return False for invalid proof input rather than raising.

🔧 Proposed fix
 def verify_merkle_proof(
     chunk_bytes: bytes,
     proof,
     merkle_root: str
 ) -> bool:
@@
-    for step in proof:
+    try:
+        steps = iter(proof)
+    except TypeError:
+        return False
+
+    for step in steps:
         if not isinstance(step, (tuple, list)) or len(step) != 2:
             return False

         sibling_hex, is_left = step
+        if not isinstance(is_left, bool):
+            return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 116 - 121, The loop over proof can
raise a TypeError for non-iterable input (e.g., None); update the verifier to
validate proof before iterating and return False for invalid inputs: add an
early guard that checks proof is an iterable container (e.g., list/tuple or use
try/except around the iteration) and return False when it is not, then proceed
with the existing loop that inspects each step (sibling_hex, is_left); locate
the code that does "for step in proof:" in openverifiablellm/utils.py and apply
this input validation to fail closed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 116-121: The loop over proof can raise a TypeError for
non-iterable input (e.g., None); update the verifier to validate proof before
iterating and return False for invalid inputs: add an early guard that checks
proof is an iterable container (e.g., list/tuple or use try/except around the
iteration) and return False when it is not, then proceed with the existing loop
that inspects each step (sibling_hex, is_left); locate the code that does "for
step in proof:" in openverifiablellm/utils.py and apply this input validation to
fail closed.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85d8e3f and 22bbd65.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

@github-actions github-actions bot added size/L and removed size/L labels Feb 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

116-130: ⚠️ Potential issue | 🟠 Major

Fail closed for non-iterable proof and non-boolean direction flags.

At Line 116, proof=None (or any non-iterable) raises instead of returning False. Also, is_left should be validated as bool to reject malformed proof steps deterministically.

🔧 Proposed fix
 def verify_merkle_proof(
     chunk_bytes: bytes,
     proof,
     merkle_root: str
 ) -> bool:
@@
     try:
         current_hash = bytes.fromhex(compute_sha256(chunk_bytes))
         expected_root = bytes.fromhex(merkle_root)
+        proof_iter = iter(proof)
     except (TypeError, ValueError):
         return False
 
-    for step in proof:
+    for step in proof_iter:
         if not isinstance(step, (tuple, list)) or len(step) != 2:
             return False
 
         sibling_hex, is_left = step
+        if not isinstance(is_left, bool):
+            return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 116 - 130, The loop over proof
assumes proof is iterable and that step contents are well-formed; update the
start of the routine (the function iterating over proof) to first return False
if proof is None or not an iterable (e.g., use isinstance(proof,
collections.abc.Iterable) or try/except TypeError on iter(proof)), and inside
the loop validate each step: ensure step is a 2-tuple/list, sibling_hex is a hex
string convertible via bytes.fromhex, and that is_left is strictly a bool
(reject non-bool values by returning False) before building combined with
sibling + current_hash or current_hash + sibling; keep existing bytes.fromhex
exception handling but add explicit is_left type check to deterministically
reject malformed steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 116-130: The loop over proof assumes proof is iterable and that
step contents are well-formed; update the start of the routine (the function
iterating over proof) to first return False if proof is None or not an iterable
(e.g., use isinstance(proof, collections.abc.Iterable) or try/except TypeError
on iter(proof)), and inside the loop validate each step: ensure step is a
2-tuple/list, sibling_hex is a hex string convertible via bytes.fromhex, and
that is_left is strictly a bool (reject non-bool values by returning False)
before building combined with sibling + current_hash or current_hash + sibling;
keep existing bytes.fromhex exception handling but add explicit is_left type
check to deterministically reject malformed steps.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22bbd65 and 771c7a8.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@github-actions github-actions bot added size/L and removed size/L labels Feb 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 102-106: The verify_merkle_proof function's proof parameter is
missing a type hint; update the signature in verify_merkle_proof to add an
appropriate type (e.g., use typing.Sequence[str] or typing.List[str] if proofs
are hex/hash strings) and add the necessary import from typing (Sequence or
List) at the top of the module; if proof has a more complex structure (tuples or
dicts), use typing.Sequence[Tuple[str, str]] or typing.Sequence[Mapping[str,
Any]] accordingly to reflect the actual proof element shape.
- Around line 46-50: The function generate_merkle_proof lacks a return type
annotation; update its signature to include the return type described in the
docstring (for example, -> Tuple[List[bytes], bytes] or the exact type the
docstring specifies) so IDEs and type checkers can validate usage; modify the
def generate_merkle_proof(...) line to add the appropriate return annotation
while keeping existing parameter names (file_path, chunk_index, chunk_size) and
default MERKLE_CHUNK_SIZE_BYTES.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 771c7a8 and 9c8039e.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
openverifiablellm/utils.py (2)

6-11: 🧹 Nitpick | 🔵 Trivial

Remove duplicate Union import.

Union is imported twice from typing (lines 6 and 11). Consolidate into a single import statement.

🔧 Proposed fix
-from typing import Union
 import hashlib
 import logging
 import json
 import platform
-from typing import Union, Optional
+from typing import Union, Optional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 6 - 11, Remove the duplicate "from
typing import Union" import: consolidate the typing imports into a single
statement that includes both Union and Optional (replace the two separate
imports at the top that reference Union with one combined import), ensuring only
one import from typing exists and there are no repeated symbols.

209-210: ⚠️ Potential issue | 🔴 Critical

Critical: All compute_sha256 calls use positional arguments but the function signature requires keyword-only arguments.

The compute_sha256 function signature (line 231) declares * making all parameters keyword-only. Lines 209-210 call it with positional arguments—passing str(raw_path) and str(processed_path) directly—which will raise TypeError: compute_sha256() takes 0 positional arguments but 1 was given at runtime.

This issue appears across multiple call sites in the file. Update all calls to use keyword arguments:

  • For byte data: compute_sha256(data=<bytes>)
  • For file paths: compute_sha256(file_path=<path>)

For lines 209-210 specifically, use file_path parameter and remove the str() conversion:

-        "raw_sha256": compute_sha256(str(raw_path)),
-        "processed_sha256": compute_sha256(str(processed_path)),
+        "raw_sha256": compute_sha256(file_path=raw_path),
+        "processed_sha256": compute_sha256(file_path=processed_path),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 209 - 210, compute_sha256 is defined
with keyword-only parameters but is being called positionally (e.g.,
compute_sha256(str(raw_path)) and compute_sha256(str(processed_path))); update
these calls (and all other positional uses) to use the correct keyword form.
Replace compute_sha256(str(raw_path)) with compute_sha256(file_path=raw_path)
and compute_sha256(str(processed_path)) with
compute_sha256(file_path=processed_path) (and similarly use
compute_sha256(data=...) where raw bytes are intended); adjust any other call
sites to use the file_path= or data= keyword to match the compute_sha256
signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 6-11: Remove the duplicate "from typing import Union" import:
consolidate the typing imports into a single statement that includes both Union
and Optional (replace the two separate imports at the top that reference Union
with one combined import), ensuring only one import from typing exists and there
are no repeated symbols.
- Around line 209-210: compute_sha256 is defined with keyword-only parameters
but is being called positionally (e.g., compute_sha256(str(raw_path)) and
compute_sha256(str(processed_path))); update these calls (and all other
positional uses) to use the correct keyword form. Replace
compute_sha256(str(raw_path)) with compute_sha256(file_path=raw_path) and
compute_sha256(str(processed_path)) with
compute_sha256(file_path=processed_path) (and similarly use
compute_sha256(data=...) where raw bytes are intended); adjust any other call
sites to use the file_path= or data= keyword to match the compute_sha256
signature.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c8039e and 5b26a43.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@github-actions github-actions bot added bug Something isn't working size/M and removed size/M labels Feb 27, 2026
@github-actions github-actions bot added size/M and removed size/M labels Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset verification limited to single SHA256 hash (no chunk-level integrity support)

1 participant