Skip to content

feat: implement reproducibility validation pipeline in CI#5

Open
Muneerali199 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Muneerali199:feat/reproducibility-pipeline
Open

feat: implement reproducibility validation pipeline in CI#5
Muneerali199 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Muneerali199:feat/reproducibility-pipeline

Conversation

@Muneerali199
Copy link

@Muneerali199 Muneerali199 commented Feb 22, 2026

Overview

OpenVerifiableLLM is designed to guarantee deterministic preprocessing and reproducible dataset fingerprints.

This PR introduces a comprehensive reproducibility validation framework that enforces bitwise-identical dataset fingerprints across:

  • Multiple executions
  • Different operating systems (Ubuntu, Windows, macOS)
  • Fresh synthetic dataset generations

This converts reproducibility from a design assumption into a continuously enforced property via automated CI.


Motivation

Determinism can silently degrade due to subtle implementation changes such as:

  • Unordered filesystem traversal
  • Non-canonical JSON serialization
  • OS-specific newline differences (LF vs CRLF)
  • Encoding inconsistencies
  • Parallel execution ordering differences
  • Dependency version changes

Without automated enforcement, these regressions are difficult to detect during code review.

Given that reproducibility is central to OpenVerifiableLLM’s mission, it must be continuously validated.


Changes Introduced

1. Deterministic Manifest Generation Improvements

File: openverifiablellm/pipeline.py

  • Implemented recursive directory traversal with strict relative path sorting.
  • Normalized path separators to forward slashes for cross-platform consistency.
  • Added newline normalization (CRLF → LF) before hashing.
  • Enforced canonical JSON serialization:
    • Sorted keys
    • No extraneous whitespace
    • Stable formatting
  • Ensured SHA256 root fingerprint is stable across environments.

2. Synthetic Dataset Generator

File: openverifiablellm/synthetic_data.py

  • Added deterministic synthetic dataset generator.
  • Generates predictable file structure and content.
  • Used as a CI baseline for reproducibility validation.
  • Ensures environment-independent benchmarking.

3. Cross-Platform Reproducibility CI Workflow

File: .github/workflows/reproducibility.yml

Introduced a multi-layer validation pipeline:

Multi-Run Validation (Same OS)

  • Executes manifest generation 3 times.
  • Verifies identical fingerprint across runs.

Cross-OS Matrix Validation

Runs on:

  • ubuntu-latest
  • windows-latest
  • macos-latest

Artifact Hash Comparison

  • Uploads root hash artifacts from each OS.
  • Final compare-hashes job asserts strict equality.
  • Fails CI immediately if any mismatch is detected.

4. Dedicated Testing Suite

File: tests/test_reproducibility.py

Added unit tests covering:

  • Multi-run determinism
  • Fresh dataset regeneration consistency
  • Newline normalization validation
  • Deterministic file ordering enforcement
  • Manifest canonical serialization integrity

5. Local Reproducibility Validation Script

File: scripts/reproducibility_check.py

Added standalone verification utility:

Run locally using:

python scripts/reproducibility_check.py

This allows contributors to confirm that modifications do not break deterministic hashing before opening a PR.


Verification

Local Testing

Executed:

python -m pytest

All tests passed, including new reproducibility enforcement tests.

Manual Cross-Run Validation (Windows)

Output:

SUCCESS: Reproducibility confirmed across 3 runs.
FINAL ROOT HASH: 4f7350f980174ea96a8b2e469777d945dc5f3ae2eb8b0fb954a8e447b9b9b401

CI Enforcement

Workflow triggers on:

  • Push to main
  • Push to develop
  • Pull Requests targeting those branches

CI will now fail automatically if reproducibility is broken.


Impact

This PR significantly strengthens OpenVerifiableLLM by:

  • Preventing silent determinism regressions
  • Enforcing cross-platform reproducibility
  • Increasing contributor confidence
  • Enhancing long-term reliability
  • Aligning infrastructure with project philosophy

Reproducibility is now mathematically validated, not assumed.


Design Principles Followed

  • Deterministic file traversal
  • Canonical serialization
  • Byte-level hashing consistency
  • Cross-platform normalization
  • CI-first enforcement strategy
  • Minimal performance overhead

Future Improvements

  • Larger-scale synthetic dataset stress tests
  • Manifest diff inspection tool
  • Performance benchmarking suite
  • Parallel processing determinism validation
  • Cryptographic manifest signing layer

Conclusion

This PR establishes a foundational enforcement mechanism for deterministic dataset fingerprinting.

OpenVerifiableLLM now moves from theoretical reproducibility guarantees to continuously verified reproducibility guarantees.

Feedback and review are welcome.

closed #6

Summary by CodeRabbit

  • New Features

    • Public API now exposes manifest generation, validation, comparison, and reproducible hashing utilities.
    • Added a command-line reproducibility test utility for generating and verifying deterministic datasets.
  • Tests

    • Added comprehensive tests for manifest reproducibility, hashing (including large/binary files), and integrity checks.
    • Added automated cross-platform reproducibility validation workflow to verify consistent hashes across OSes.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 22, 2026

Warning

Rate limit exceeded

@Muneerali199 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 58 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b9a3aaf and 9ce1262.

📒 Files selected for processing (4)
  • openverifiablellm/__init__.py
  • openverifiablellm/pipeline.py
  • scripts/test_reproducibility.py
  • tests/test_reproducibility_pipeline.py

Walkthrough

Adds a cross-platform reproducibility CI workflow, a manifest pipeline module with hashing/normalization utilities, test scripts and test suites, a reproducibility test script, and a reproducible-hash artifact file to validate deterministic dataset manifests across OSes and runs.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/reproducibility-tests.yml
New GitHub Actions workflow with three jobs: dataset hashing job, cross-platform matrix (ubuntu/windows/macos) producing per-OS reproducible-hash-<os> artifacts, and a hash-comparison job that aggregates and fails on mismatches.
Core Pipeline Module
openverifiablellm/pipeline.py
New manifest pipeline: line-ending normalization, normalized-per-file SHA-256 (with fallback), deterministic manifest generation (posix paths, sorted), manifest hash computation, manifest integrity validation, and manifest comparison utilities.
Public API Exports
openverifiablellm/__init__.py
Re-exports compute_sha256 (from utils) and pipeline functions: generate_manifest, get_manifest_hash, run_pipeline, compute_normalized_sha256, normalize_line_endings, validate_manifest_integrity, compare_manifests. Updates __all__.
Reproducibility Test Script & Output
scripts/test_reproducibility.py, reproducible_hash.txt
Adds a CLI script to create deterministic test datasets, run multi-run reproducibility checks, and optionally write the computed reproducible hash; includes a checked-in reproducible_hash.txt containing the expected hash.
Unit Tests — Hashing
tests/test_dataset_hash.py
New tests covering compute_sha256: correctness, chunked processing, empty/binary/large files, FileNotFound handling, and Path vs string input parity.
Unit Tests — Reproducibility Pipeline
tests/test_reproducibility_pipeline.py
Comprehensive tests for manifest generation, deterministic ordering, stable manifest hashing, change detection, empty/large datasets, cross-platform path normalization, integrity verification, JSON serialization stability, binary file handling, and end-to-end pipeline consistency.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant Runner as OS Runner
    participant Pipeline as Manifest Pipeline
    participant FS as Filesystem
    participant Hash as SHA-256 Service
    participant Artifact as Artifact Storage
    participant Compare as Hash Comparator

    GHA->>Runner: Start matrix job (ubuntu/windows/macos)
    Runner->>Pipeline: Run reproducibility script / pipeline
    Pipeline->>FS: Walk directory (posix-relative paths)
    FS->>Pipeline: Return file list
    Pipeline->>Hash: Normalize contents (LF), compute per-file SHA-256
    Hash->>Pipeline: Return per-file hashes
    Pipeline->>Hash: Serialize deterministic manifest JSON, compute manifest hash
    Pipeline->>Artifact: Upload `reproducible-hash-<os>` artifact
    Artifact->>Compare: Download all `reproducible-hash-*`
    Compare->>Compare: Compare hashes for bitwise equality
    alt All match
        Compare->>GHA: Success
    else Any mismatch
        Compare->>GHA: Fail CI
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I hopped through files, normalized each line,
hashes matched neatly, bitwise and fine.
From Linux to Mac and Windows bright,
reproducible wonders in plain sight.
A carrot for CI—determinism divine! 🥕

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: implementing a reproducibility validation pipeline in CI to enforce deterministic dataset fingerprints across executions and operating systems.
Linked Issues check ✅ Passed The PR implements all core coding requirements from issue #6: synthetic dataset support via pipeline.py utilities, multi-run validation workflow in reproducibility-tests.yml, cross-platform matrix testing (ubuntu/windows/macos), hash comparison logic, and comprehensive test coverage including reproducibility tests.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #6 requirements: pipeline utilities for deterministic manifest generation, GitHub Actions workflow for multi-run/multi-OS validation, reproducibility test scripts, and unit tests. No extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Archit381
Copy link
Member

@Muneerali199 most of the utilities from the openverifiablellm/pipeline.py have already been added in previous PRs.

Can you build upon the current setup and add some concrete tests to use in github actions. Current one doesnt seem that important

…flow

- Add comprehensive test suite for dataset hashing (8 tests)
- Add pipeline integration tests (13 tests)
- Add GitHub Actions workflow with cross-platform validation
- Add CLI script for reproducibility testing
- Add pipeline utilities with manifest generation
@Muneerali199 Muneerali199 force-pushed the feat/reproducibility-pipeline branch from a4453c1 to fe027c0 Compare February 27, 2026 09:57
@github-actions github-actions bot added documentation Improvements or additions to documentation size/XL and removed size/L size/XL labels Feb 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function currently only
checks listed manifest entries and lets file-missing errors propagate; update it
so it (1) safely handles missing/unreadable files by catching
FileNotFoundError/OSError from compute_normalized_sha256/compute_sha256 and
returning False instead of raising, and (2) ensures there are no extra files in
the target directory by comparing the set of manifest paths (entry["path"]) to
the set of actual file paths under directory_path (e.g., using Path.rglob or
Path.iterdir and relative paths), returning False if any extra or missing files
are found, while still performing the existing per-entry hash comparison
(recomputed_hash) in the loop.

In `@scripts/test_reproducibility.py`:
- Around line 15-33: The manifest hashing in generate_manifest uses raw
compute_sha256 on files, causing divergence from production because it skips
openverifiablellm.pipeline's normalization; update generate_manifest to read
each file, pass its contents through the pipeline normalization routine (use the
normalization helper in openverifiablellm.pipeline — e.g.,
pipeline.normalize_content or the equivalent exported function) and compute the
SHA256 over the normalized bytes/string instead of the raw file bytes, keeping
the rest of the manifest fields the same; ensure get_manifest_hash still
canonicalizes JSON with sort_keys and separators as before so overall hashing
matches production semantics.

In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: The test defines local duplicates of normalize_line_endings,
generate_manifest, and get_manifest_hash which shadow the production API; remove
these local implementations and instead import the canonical functions from
openverifiablellm.pipeline (e.g., from openverifiablellm.pipeline import
normalize_line_endings, generate_manifest, get_manifest_hash) and update any
test references to call those imported symbols so the tests exercise the real
production implementations.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cb91e9 and fe027c0.

📒 Files selected for processing (7)
  • .github/workflows/reproducibility-tests.yml
  • openverifiablellm/__init__.py
  • openverifiablellm/pipeline.py
  • reproducible_hash.txt
  • scripts/test_reproducibility.py
  • tests/test_dataset_hash.py
  • tests/test_reproducibility_pipeline.py

@github-actions github-actions bot added size/XL and removed size/XL labels Feb 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
tests/test_reproducibility_pipeline.py (1)

9-33: ⚠️ Potential issue | 🟠 Major

Tests should call openverifiablellm.pipeline directly, not local helper clones.

These local implementations shadow production behavior, so the suite can miss regressions in shipped code paths (including normalization/hash behavior).

Proposed fix
-import json
-import hashlib
+import json
+import hashlib
@@
-from openverifiablellm import compute_sha256
+from openverifiablellm import compute_sha256
+from openverifiablellm.pipeline import (
+    normalize_line_endings,
+    generate_manifest,
+    get_manifest_hash,
+)
@@
-def normalize_line_endings(content: str) -> str:
-    ...
-
-def generate_manifest(directory_path: Path) -> dict:
-    ...
-
-def get_manifest_hash(manifest: dict) -> str:
-    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_reproducibility_pipeline.py` around lines 9 - 33, Replace the
local test helpers with calls to the production implementations instead of
shadowing them: remove or stop using the local normalize_line_endings,
generate_manifest, and get_manifest_hash functions in tests and import them from
the package (e.g. from openverifiablellm.pipeline import normalize_line_endings,
generate_manifest, get_manifest_hash or import openverifiablellm.pipeline and
reference pipeline.generate_manifest/etc.); ensure the tests call the production
functions so normalization, hashing, and manifest generation behavior exercised
in tests matches shipped code.
openverifiablellm/pipeline.py (1)

53-63: ⚠️ Potential issue | 🟠 Major

validate_manifest_integrity can miss tampering and break its boolean contract.

It currently ignores extra files and may raise on missing/unreadable paths instead of returning False.

Proposed fix
 def validate_manifest_integrity(manifest: Dict, directory_path: Union[str, Path]) -> bool:
     dir_path = Path(directory_path)
-    for entry in manifest["files"]:
+    manifest_files = manifest.get("files", [])
+    manifest_paths = {entry.get("path") for entry in manifest_files if "path" in entry}
+    actual_paths = {
+        str(p.relative_to(dir_path)).replace("\\", "/")
+        for p in dir_path.glob("**/*")
+        if p.is_file()
+    }
+    if manifest_paths != actual_paths:
+        return False
+
+    for entry in manifest_files:
         file_path = dir_path / entry["path"]
         try:
-            recomputed_hash = compute_normalized_sha256(file_path)
-        except (UnicodeDecodeError, ValueError):
-            recomputed_hash = compute_sha256(file_path)
+            try:
+                recomputed_hash = compute_normalized_sha256(file_path)
+            except (UnicodeDecodeError, ValueError):
+                recomputed_hash = compute_sha256(file_path)
+        except (FileNotFoundError, OSError):
+            return False
         if entry["sha256"] != recomputed_hash:
             return False
     return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/pipeline.py` around lines 53 - 63, The
validate_manifest_integrity function can raise on missing/unreadable files and
currently ignores extra files; update it so it always returns a bool by (1)
verifying the set of actual files under directory_path (relative paths) matches
exactly the set of manifest["files"] paths (treat extra or missing files as
integrity failure), and (2) when iterating entries call
compute_normalized_sha256/compute_sha256 but catch any file-access errors (e.g.,
FileNotFoundError, PermissionError, UnicodeDecodeError, ValueError and any other
IO-related exceptions) and return False instead of letting exceptions propagate;
reference validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.
scripts/test_reproducibility.py (1)

12-33: ⚠️ Potential issue | 🔴 Critical

Use the production pipeline API for manifest/hash generation.

At Line 22, raw compute_sha256 hashing diverges from openverifiablellm/pipeline.py semantics (normalized text hashing + canonical manifest flow). This undermines the CI reproducibility signal.

Proposed fix
-from openverifiablellm import compute_sha256
+from openverifiablellm.pipeline import (
+    generate_manifest,
+    get_manifest_hash,
+)

-def generate_manifest(directory_path: Path) -> dict:
-    ...
-
-def get_manifest_hash(manifest: dict) -> str:
-    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test_reproducibility.py` around lines 12 - 33, The current
generate_manifest and get_manifest_hash use compute_sha256 directly which
bypasses the production pipeline semantics; instead import and call the
canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/__init__.py`:
- Around line 12-21: The __all__ export list is unsorted and triggers RUF022;
sort the entries in the __all__ list into a deterministic order (e.g.,
alphabetical) so the export declarations are stable under lint checks—update the
__all__ array containing "compute_sha256", "compute_normalized_sha256",
"compare_manifests", "generate_manifest", "get_manifest_hash",
"normalize_line_endings", "run_pipeline", and "validate_manifest_integrity" to
be in sorted order.

In `@scripts/test_reproducibility.py`:
- Around line 64-72: run_all_tests currently ignores CLI options by using
defaults; change its signature to accept runs and seed parameters (e.g., def
run_all_tests(runs: int, seed: int) -> Dict[str, bool]) and pass those values
into test_reproducibility_multiple_runs (and any other test helper it calls).
Then update the caller in main() to forward the parsed CLI --runs and --seed
into run_all_tests. Also update the other test invocations inside run_all_tests
(the calls around the 91-95 area) so they receive and propagate runs/seed as
needed to ensure CLI values are used throughout.

---

Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function can raise on
missing/unreadable files and currently ignores extra files; update it so it
always returns a bool by (1) verifying the set of actual files under
directory_path (relative paths) matches exactly the set of manifest["files"]
paths (treat extra or missing files as integrity failure), and (2) when
iterating entries call compute_normalized_sha256/compute_sha256 but catch any
file-access errors (e.g., FileNotFoundError, PermissionError,
UnicodeDecodeError, ValueError and any other IO-related exceptions) and return
False instead of letting exceptions propagate; reference
validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.

In `@scripts/test_reproducibility.py`:
- Around line 12-33: The current generate_manifest and get_manifest_hash use
compute_sha256 directly which bypasses the production pipeline semantics;
instead import and call the canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.

In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: Replace the local test helpers with calls to the production
implementations instead of shadowing them: remove or stop using the local
normalize_line_endings, generate_manifest, and get_manifest_hash functions in
tests and import them from the package (e.g. from openverifiablellm.pipeline
import normalize_line_endings, generate_manifest, get_manifest_hash or import
openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.);
ensure the tests call the production functions so normalization, hashing, and
manifest generation behavior exercised in tests matches shipped code.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe027c0 and b9a3aaf.

📒 Files selected for processing (5)
  • openverifiablellm/__init__.py
  • openverifiablellm/pipeline.py
  • scripts/test_reproducibility.py
  • tests/test_dataset_hash.py
  • tests/test_reproducibility_pipeline.py

Comment on lines 12 to 21
__all__ = [
"compute_sha256",
"generate_manifest",
"get_manifest_hash",
"run_pipeline",
"compute_normalized_sha256",
"normalize_line_endings",
"validate_manifest_integrity",
"compare_manifests",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sort __all__ to keep export declarations stable under lint checks.

The current list order triggers Ruff RUF022; sort entries (or keep a deliberate deterministic order consistent with your lint config) to avoid CI noise.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 12-21: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/__init__.py` around lines 12 - 21, The __all__ export list
is unsorted and triggers RUF022; sort the entries in the __all__ list into a
deterministic order (e.g., alphabetical) so the export declarations are stable
under lint checks—update the __all__ array containing "compute_sha256",
"compute_normalized_sha256", "compare_manifests", "generate_manifest",
"get_manifest_hash", "normalize_line_endings", "run_pipeline", and
"validate_manifest_integrity" to be in sorted order.

- Sort __all__ exports alphabetically (RUF022)
- Update run_all_tests to accept runs/seed parameters
- Use production pipeline functions in test_reproducibility.py
- Fix validate_manifest_integrity to check extra/missing files and handle exceptions
- Replace local test helpers with production implementations in tests
@github-actions github-actions bot added size/XL and removed size/XL labels Feb 27, 2026
@Muneerali199
Copy link
Author

sir could you review it now @Archit381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add Cross-Platform Reproducibility CI Matrix (Multi-Run Determinism Validation)

2 participants