feat: implement reproducibility validation pipeline in CI#5
feat: implement reproducibility validation pipeline in CI#5Muneerali199 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (4)
WalkthroughAdds a cross-platform reproducibility CI workflow, a manifest pipeline module with hashing/normalization utilities, test scripts and test suites, a reproducibility test script, and a reproducible-hash artifact file to validate deterministic dataset manifests across OSes and runs. Changes
Sequence Diagram(s)sequenceDiagram
participant GHA as GitHub Actions
participant Runner as OS Runner
participant Pipeline as Manifest Pipeline
participant FS as Filesystem
participant Hash as SHA-256 Service
participant Artifact as Artifact Storage
participant Compare as Hash Comparator
GHA->>Runner: Start matrix job (ubuntu/windows/macos)
Runner->>Pipeline: Run reproducibility script / pipeline
Pipeline->>FS: Walk directory (posix-relative paths)
FS->>Pipeline: Return file list
Pipeline->>Hash: Normalize contents (LF), compute per-file SHA-256
Hash->>Pipeline: Return per-file hashes
Pipeline->>Hash: Serialize deterministic manifest JSON, compute manifest hash
Pipeline->>Artifact: Upload `reproducible-hash-<os>` artifact
Artifact->>Compare: Download all `reproducible-hash-*`
Compare->>Compare: Compare hashes for bitwise equality
alt All match
Compare->>GHA: Success
else Any mismatch
Compare->>GHA: Fail CI
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@Muneerali199 most of the utilities from the Can you build upon the current setup and add some concrete tests to use in github actions. Current one doesnt seem that important |
…flow - Add comprehensive test suite for dataset hashing (8 tests) - Add pipeline integration tests (13 tests) - Add GitHub Actions workflow with cross-platform validation - Add CLI script for reproducibility testing - Add pipeline utilities with manifest generation
a4453c1 to
fe027c0
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function currently only
checks listed manifest entries and lets file-missing errors propagate; update it
so it (1) safely handles missing/unreadable files by catching
FileNotFoundError/OSError from compute_normalized_sha256/compute_sha256 and
returning False instead of raising, and (2) ensures there are no extra files in
the target directory by comparing the set of manifest paths (entry["path"]) to
the set of actual file paths under directory_path (e.g., using Path.rglob or
Path.iterdir and relative paths), returning False if any extra or missing files
are found, while still performing the existing per-entry hash comparison
(recomputed_hash) in the loop.
In `@scripts/test_reproducibility.py`:
- Around line 15-33: The manifest hashing in generate_manifest uses raw
compute_sha256 on files, causing divergence from production because it skips
openverifiablellm.pipeline's normalization; update generate_manifest to read
each file, pass its contents through the pipeline normalization routine (use the
normalization helper in openverifiablellm.pipeline — e.g.,
pipeline.normalize_content or the equivalent exported function) and compute the
SHA256 over the normalized bytes/string instead of the raw file bytes, keeping
the rest of the manifest fields the same; ensure get_manifest_hash still
canonicalizes JSON with sort_keys and separators as before so overall hashing
matches production semantics.
In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: The test defines local duplicates of normalize_line_endings,
generate_manifest, and get_manifest_hash which shadow the production API; remove
these local implementations and instead import the canonical functions from
openverifiablellm.pipeline (e.g., from openverifiablellm.pipeline import
normalize_line_endings, generate_manifest, get_manifest_hash) and update any
test references to call those imported symbols so the tests exercise the real
production implementations.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
.github/workflows/reproducibility-tests.ymlopenverifiablellm/__init__.pyopenverifiablellm/pipeline.pyreproducible_hash.txtscripts/test_reproducibility.pytests/test_dataset_hash.pytests/test_reproducibility_pipeline.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
tests/test_reproducibility_pipeline.py (1)
9-33:⚠️ Potential issue | 🟠 MajorTests should call
openverifiablellm.pipelinedirectly, not local helper clones.These local implementations shadow production behavior, so the suite can miss regressions in shipped code paths (including normalization/hash behavior).
Proposed fix
-import json -import hashlib +import json +import hashlib @@ -from openverifiablellm import compute_sha256 +from openverifiablellm import compute_sha256 +from openverifiablellm.pipeline import ( + normalize_line_endings, + generate_manifest, + get_manifest_hash, +) @@ -def normalize_line_endings(content: str) -> str: - ... - -def generate_manifest(directory_path: Path) -> dict: - ... - -def get_manifest_hash(manifest: dict) -> str: - ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_reproducibility_pipeline.py` around lines 9 - 33, Replace the local test helpers with calls to the production implementations instead of shadowing them: remove or stop using the local normalize_line_endings, generate_manifest, and get_manifest_hash functions in tests and import them from the package (e.g. from openverifiablellm.pipeline import normalize_line_endings, generate_manifest, get_manifest_hash or import openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.); ensure the tests call the production functions so normalization, hashing, and manifest generation behavior exercised in tests matches shipped code.openverifiablellm/pipeline.py (1)
53-63:⚠️ Potential issue | 🟠 Major
validate_manifest_integritycan miss tampering and break its boolean contract.It currently ignores extra files and may raise on missing/unreadable paths instead of returning
False.Proposed fix
def validate_manifest_integrity(manifest: Dict, directory_path: Union[str, Path]) -> bool: dir_path = Path(directory_path) - for entry in manifest["files"]: + manifest_files = manifest.get("files", []) + manifest_paths = {entry.get("path") for entry in manifest_files if "path" in entry} + actual_paths = { + str(p.relative_to(dir_path)).replace("\\", "/") + for p in dir_path.glob("**/*") + if p.is_file() + } + if manifest_paths != actual_paths: + return False + + for entry in manifest_files: file_path = dir_path / entry["path"] try: - recomputed_hash = compute_normalized_sha256(file_path) - except (UnicodeDecodeError, ValueError): - recomputed_hash = compute_sha256(file_path) + try: + recomputed_hash = compute_normalized_sha256(file_path) + except (UnicodeDecodeError, ValueError): + recomputed_hash = compute_sha256(file_path) + except (FileNotFoundError, OSError): + return False if entry["sha256"] != recomputed_hash: return False return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/pipeline.py` around lines 53 - 63, The validate_manifest_integrity function can raise on missing/unreadable files and currently ignores extra files; update it so it always returns a bool by (1) verifying the set of actual files under directory_path (relative paths) matches exactly the set of manifest["files"] paths (treat extra or missing files as integrity failure), and (2) when iterating entries call compute_normalized_sha256/compute_sha256 but catch any file-access errors (e.g., FileNotFoundError, PermissionError, UnicodeDecodeError, ValueError and any other IO-related exceptions) and return False instead of letting exceptions propagate; reference validate_manifest_integrity, manifest["files"], entry["path"], compute_normalized_sha256 and compute_sha256 when making these changes.scripts/test_reproducibility.py (1)
12-33:⚠️ Potential issue | 🔴 CriticalUse the production pipeline API for manifest/hash generation.
At Line 22, raw
compute_sha256hashing diverges fromopenverifiablellm/pipeline.pysemantics (normalized text hashing + canonical manifest flow). This undermines the CI reproducibility signal.Proposed fix
-from openverifiablellm import compute_sha256 +from openverifiablellm.pipeline import ( + generate_manifest, + get_manifest_hash, +) -def generate_manifest(directory_path: Path) -> dict: - ... - -def get_manifest_hash(manifest: dict) -> str: - ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test_reproducibility.py` around lines 12 - 33, The current generate_manifest and get_manifest_hash use compute_sha256 directly which bypasses the production pipeline semantics; instead import and call the canonical pipeline functions (e.g., use openverifiablellm.pipeline.generate_manifest and openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API) rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest to delegate to the pipeline API for normalized text hashing and canonical manifest creation and replace get_manifest_hash to call the pipeline's manifest-hash function so CI uses the exact production manifest/hash logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/__init__.py`:
- Around line 12-21: The __all__ export list is unsorted and triggers RUF022;
sort the entries in the __all__ list into a deterministic order (e.g.,
alphabetical) so the export declarations are stable under lint checks—update the
__all__ array containing "compute_sha256", "compute_normalized_sha256",
"compare_manifests", "generate_manifest", "get_manifest_hash",
"normalize_line_endings", "run_pipeline", and "validate_manifest_integrity" to
be in sorted order.
In `@scripts/test_reproducibility.py`:
- Around line 64-72: run_all_tests currently ignores CLI options by using
defaults; change its signature to accept runs and seed parameters (e.g., def
run_all_tests(runs: int, seed: int) -> Dict[str, bool]) and pass those values
into test_reproducibility_multiple_runs (and any other test helper it calls).
Then update the caller in main() to forward the parsed CLI --runs and --seed
into run_all_tests. Also update the other test invocations inside run_all_tests
(the calls around the 91-95 area) so they receive and propagate runs/seed as
needed to ensure CLI values are used throughout.
---
Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function can raise on
missing/unreadable files and currently ignores extra files; update it so it
always returns a bool by (1) verifying the set of actual files under
directory_path (relative paths) matches exactly the set of manifest["files"]
paths (treat extra or missing files as integrity failure), and (2) when
iterating entries call compute_normalized_sha256/compute_sha256 but catch any
file-access errors (e.g., FileNotFoundError, PermissionError,
UnicodeDecodeError, ValueError and any other IO-related exceptions) and return
False instead of letting exceptions propagate; reference
validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.
In `@scripts/test_reproducibility.py`:
- Around line 12-33: The current generate_manifest and get_manifest_hash use
compute_sha256 directly which bypasses the production pipeline semantics;
instead import and call the canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.
In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: Replace the local test helpers with calls to the production
implementations instead of shadowing them: remove or stop using the local
normalize_line_endings, generate_manifest, and get_manifest_hash functions in
tests and import them from the package (e.g. from openverifiablellm.pipeline
import normalize_line_endings, generate_manifest, get_manifest_hash or import
openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.);
ensure the tests call the production functions so normalization, hashing, and
manifest generation behavior exercised in tests matches shipped code.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
openverifiablellm/__init__.pyopenverifiablellm/pipeline.pyscripts/test_reproducibility.pytests/test_dataset_hash.pytests/test_reproducibility_pipeline.py
| __all__ = [ | ||
| "compute_sha256", | ||
| "generate_manifest", | ||
| "get_manifest_hash", | ||
| "run_pipeline", | ||
| "compute_normalized_sha256", | ||
| "normalize_line_endings", | ||
| "validate_manifest_integrity", | ||
| "compare_manifests", | ||
| ] |
There was a problem hiding this comment.
Sort __all__ to keep export declarations stable under lint checks.
The current list order triggers Ruff RUF022; sort entries (or keep a deliberate deterministic order consistent with your lint config) to avoid CI noise.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 12-21: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/__init__.py` around lines 12 - 21, The __all__ export list
is unsorted and triggers RUF022; sort the entries in the __all__ list into a
deterministic order (e.g., alphabetical) so the export declarations are stable
under lint checks—update the __all__ array containing "compute_sha256",
"compute_normalized_sha256", "compare_manifests", "generate_manifest",
"get_manifest_hash", "normalize_line_endings", "run_pipeline", and
"validate_manifest_integrity" to be in sorted order.
- Sort __all__ exports alphabetically (RUF022) - Update run_all_tests to accept runs/seed parameters - Use production pipeline functions in test_reproducibility.py - Fix validate_manifest_integrity to check extra/missing files and handle exceptions - Replace local test helpers with production implementations in tests
|
sir could you review it now @Archit381 |
Overview
OpenVerifiableLLM is designed to guarantee deterministic preprocessing and reproducible dataset fingerprints.
This PR introduces a comprehensive reproducibility validation framework that enforces bitwise-identical dataset fingerprints across:
This converts reproducibility from a design assumption into a continuously enforced property via automated CI.
Motivation
Determinism can silently degrade due to subtle implementation changes such as:
LFvsCRLF)Without automated enforcement, these regressions are difficult to detect during code review.
Given that reproducibility is central to OpenVerifiableLLM’s mission, it must be continuously validated.
Changes Introduced
1. Deterministic Manifest Generation Improvements
File:
openverifiablellm/pipeline.pyCRLF → LF) before hashing.2. Synthetic Dataset Generator
File:
openverifiablellm/synthetic_data.py3. Cross-Platform Reproducibility CI Workflow
File:
.github/workflows/reproducibility.ymlIntroduced a multi-layer validation pipeline:
Multi-Run Validation (Same OS)
Cross-OS Matrix Validation
Runs on:
ubuntu-latestwindows-latestmacos-latestArtifact Hash Comparison
compare-hashesjob asserts strict equality.4. Dedicated Testing Suite
File:
tests/test_reproducibility.pyAdded unit tests covering:
5. Local Reproducibility Validation Script
File:
scripts/reproducibility_check.pyAdded standalone verification utility:
Run locally using:
python scripts/reproducibility_check.pyThis allows contributors to confirm that modifications do not break deterministic hashing before opening a PR.
Verification
Local Testing
Executed:
python -m pytestAll tests passed, including new reproducibility enforcement tests.
Manual Cross-Run Validation (Windows)
Output:
SUCCESS: Reproducibility confirmed across 3 runs.
FINAL ROOT HASH: 4f7350f980174ea96a8b2e469777d945dc5f3ae2eb8b0fb954a8e447b9b9b401
CI Enforcement
Workflow triggers on:
maindevelopCI will now fail automatically if reproducibility is broken.
Impact
This PR significantly strengthens OpenVerifiableLLM by:
Reproducibility is now mathematically validated, not assumed.
Design Principles Followed
Future Improvements
Conclusion
This PR establishes a foundational enforcement mechanism for deterministic dataset fingerprinting.
OpenVerifiableLLM now moves from theoretical reproducibility guarantees to continuously verified reproducibility guarantees.
Feedback and review are welcome.
closed #6
Summary by CodeRabbit
New Features
Tests