Skip to content

docs: add Index Authority Receipts for IFC evidence#270

Open
Maurice Witten (blocksifrdev) wants to merge 3 commits into
Project-Navi:mainfrom
blocksifrdev:docs/add-index-authority-receipts
Open

docs: add Index Authority Receipts for IFC evidence#270
Maurice Witten (blocksifrdev) wants to merge 3 commits into
Project-Navi:mainfrom
blocksifrdev:docs/add-index-authority-receipts

Conversation

@blocksifrdev

Copy link
Copy Markdown

Summary

Adds an optional CAIF-style Index Authority Receipt for ordvec benchmark evidence.

The goal is to make ordvec's index-first retrieval evidence machine-readable: quality delta, bytes/vector, latency regime, benchmark scope, limitations, fallback conditions, and a deterministic receipt hash.

Why

ordvec already has a strong index-first compute story: compressed ordinal/sign retrieval can preserve retrieval quality under stated benchmark scopes while reducing storage and latency.

This PR adds a small evidence packet and verifier so downstream systems can answer:

Is this compressed/index-first retrieval path authorized to answer before dense compute for this stated workload scope?

What this includes

  • docs/INDEX_AUTHORITY_RECEIPTS.md
  • schemas/caif/ordvec-index-authority.v0.1.schema.json
  • examples/caif/trec-covid-sign-rq2.index-authority.json
  • tools/verify_index_authority.py

What this does not do

  • Does not change Rust code
  • Does not change Cargo.toml
  • Does not add runtime dependencies
  • Does not add CI requirements
  • Does not claim new benchmark results
  • Does not add signing, key management, or deployment trust policy

Verification

python3 tools/verify_index_authority.py examples/caif/trec-covid-sign-rq2.index-authority.json

Expected output includes:

decision: ALLOW_INDEX_FIRST
quality_within_bootstrap_noise: true
storage_reduction: 10.67x
single_query_speedup: 105.66x
receipt_hash: sha256:...

Scope

The example uses existing public README benchmark values and preserves the stated limitations around dataset, encoder, corpus size, batch/threading regime, HNSW comparison, and larger-corpus claims.

Framing

Benchmarks should not only report performance.

They should authorize compute paths within a defined evidence envelope.

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a880e2b115

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread tools/verify_index_authority.py Outdated
Comment on lines +84 to +87
if float(economics.get("storage_reduction_x", 0)) < float(policy.get("min_storage_reduction_x", 0)):
return "REQUIRE_DENSE_FALLBACK"

if float(economics.get("single_query_speedup_x", 0)) < float(policy.get("min_single_query_speedup_x", 0)):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute ratios before applying policy thresholds

When verifying a receipt whose raw bytes/latencies don’t match the derived storage_reduction_x/single_query_speedup_x, this branch authorizes based only on the supplied derived numbers. For example, a receipt can set baseline.bytes_per_vector == candidate_bytes_per_vector and equal latencies, but inflate both *_x fields above the policy thresholds and still get ALLOW_INDEX_FIRST. Since those ratios are included alongside their source values in the receipt, the verifier should recompute or at least cross-check them before using them for authorization.

Useful? React with 👍 / 👎.

Comment thread tools/verify_index_authority.py Outdated
args = parser.parse_args()

data = load_json(args.receipt)
errors = shape_errors(data)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject schema-invalid receipts before computing decisions

The verifier’s only validation here is the custom shape_errors() subset, so receipts that violate the checked-in JSON schema either crash later or are still authorized. For example, omitting the schema-required decision.policy raises a traceback in compute_decision, while strings for numeric fields or extra properties are accepted even though the schema rejects them. Since this command is the documented verifier for machine-readable receipts, it should run full schema validation or mirror the required/type/additionalProperties checks before computing authorization.

Useful? React with 👍 / 👎.

Comment thread tools/verify_index_authority.py Outdated
Comment on lines +87 to +90
if float(economics.get("single_query_speedup_x", 0)) < float(policy.get("min_single_query_speedup_x", 0)):
return "REQUIRE_DENSE_FALLBACK"

return "ALLOW_INDEX_FIRST"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle the HNSW-comparison decision state

The schema and docs advertise REQUIRE_HNSW_COMPARISON as a valid decision, but compute_decision has no path that can return it; after the existing checks pass, every receipt falls through to ALLOW_INDEX_FIRST. A receipt for the documented regime where graph/ANN comparison is required will therefore always fail with a decision mismatch, so the verifier needs a policy/scope predicate for that state or the state should not be accepted as valid.

Useful? React with 👍 / 👎.

Comment thread tools/verify_index_authority.py Outdated
ifc = data["ifc"]
evidence = data["evidence"]
economics = data["economics"]
policy = data["decision"]["policy"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require verifier-owned acceptance policy

Because the verifier reads the policy thresholds from the receipt being evaluated, a schema-valid receipt can authorize itself by lowering min_storage_reduction_x/min_single_query_speedup_x to zero or disabling the quality requirement, even when the reported speedup and storage reduction are below any meaningful bar. For an authorization verifier, these acceptance rules need to come from the verifier configuration or fixed minimums rather than the untrusted evidence packet itself.

Useful? React with 👍 / 👎.

Signed-off-by: blocksifrdev <maurice@blocksifr.com>
@blocksifrdev Maurice Witten (blocksifrdev) force-pushed the docs/add-index-authority-receipts branch from a880e2b to d302dd7 Compare June 20, 2026 14:02
@Fieldnote-Echo

Copy link
Copy Markdown
Member

Thank you for your contribution! We will review the PR as soon as we have the bandwidth. 🙏🏻

@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) added the review-this Trigger OpenHands PR review label Jun 20, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — Core concept is sound, but there are gaps to address before merging.

The Index Authority Receipt pattern is a good fit for ordvec's evidence story. However, the implementation has security and completeness issues that need resolution.

Key issues:

  1. Missing schema file (documentation says it exists)
  2. Self-signed policy thresholds (verifier should own acceptance policy)
  3. Unreachable REQUIRE_HNSW_COMPARISON decision
  4. No test coverage for the verifier
  5. No CI validation that receipts stay valid

See inline comments for details.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/Project-Navi/ordvec/actions/runs/27879938562

Comment thread tools/verify_index_authority.py Outdated
except Exception as e:
die(f"cannot read receipt: {e}")

for k in ["schema","subject","baseline","ifc","evidence","economics","decision","scope","limitations"]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Missing schema validation — The PR description lists schemas/caif/ordvec-index-authority.v0.1.schema.json but this file does not exist in the PR. The verifier only checks field presence and schema string equality, not JSON Schema validation. A malformed receipt will either crash later or pass silently. Either remove the schema reference from the PR description, or add the schema file and use jsonschema (stdlib-compatible, no new dependencies) to validate receipts before processing.

Comment thread tools/verify_index_authority.py Outdated
if abs(econ["storage_reduction_x"] - expected_storage) > 0.02:
die("storage_reduction_x mismatch")

expected_speedup = econ["single_query_latency_ms"]["baseline"] / econ["single_query_latency_ms"]["candidate"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Self-signed policy thresholds — The verifier reads acceptance policy (min_storage_reduction_x, min_single_query_speedup_x) from the receipt being evaluated. A receipt can authorize itself by setting min_storage_reduction_x: 0.01. The policy thresholds should be verifier-owned, not receipt-owned. Options:

  • Move thresholds to a separate verifier config file
  • Require thresholds to be above documented minimums
  • Document explicitly that this is a self-certifying receipt system, not a trusted verifier

if abs(econ["single_query_speedup_x"] - expected_speedup) > 0.02:
die("single_query_speedup_x mismatch")

decision = "ALLOW_INDEX_FIRST"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: REQUIRE_HNSW_COMPARISON is unreachable — The compute_decision logic can only return ALLOW_INDEX_FIRST, REQUIRE_DENSE_FALLBACK, or DENY_UNSCOPED_CLAIM. REQUIRE_HNSW_COMPARISON is never assigned. Either:

  • Add a code path that returns it (e.g., when highly parallel threaded serving is claimed but no HNSW comparison exists)
  • Remove it from the advertised schema/decision set
  • Document why it exists as a valid decision but is not reachable in this implementation

Comment thread tools/verify_index_authority.py Outdated
print(f"single_query_speedup: {econ['single_query_speedup_x']}x")
print(f"receipt_hash: {sha(r)}")

if decision != r["decision"]["recommended"]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Add test coverage — The verifier has no tests. At minimum, add tests for:

  • Valid receipt passes
  • Missing fields are rejected
  • Computed metrics that don't match declared values are rejected
  • The decision mismatch exit code (3)

Example location: tests/verify_index_authority_test.py

- `examples/caif/trec-covid-sign-rq2.index-authority.json`
- `tools/verify_index_authority.py`

## What this does not do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Documentation mismatch — The "What this includes" section lists schemas/caif/ordvec-index-authority.v0.1.schema.json but this file is not present in the PR. Either add the schema file or remove it from this list.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e501704594

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

if not isinstance(limitations, list):
die("limitations must be a list")

decision = "ALLOW_INDEX_FIRST"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject receipts where IFC is disabled

A receipt can set ifc.enabled to false or leave ifc.compute_path empty and still reach the default ALLOW_INDEX_FIRST decision because the verifier never reads the ifc object after requiring it exists. That allows a receipt which explicitly says no index-first path is enabled to verify successfully as long as the benchmark metrics pass, which undermines the authorization this tool is meant to provide.

Useful? React with 👍 / 👎.

Comment thread tools/verify_index_authority.py Outdated
Comment on lines +184 to +187
has_hnsw_comparison = (
e.get("compared_against_hnsw") is True
or isinstance(e.get("hnsw_comparison"), dict)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate HNSW comparison evidence before allowing claims

For an applies_to value containing production/parallel serving terms, an empty hnsw_comparison object or a bare compared_against_hnsw: true makes has_hnsw_comparison true and bypasses REQUIRE_HNSW_COMPARISON. That lets a receipt with no HNSW metrics or artifacts verify as ALLOW_INDEX_FIRST in exactly the policy-protected context, so this should require concrete comparison fields rather than just a marker.

Useful? React with 👍 / 👎.


quality_loss = baseline_score - candidate_score
quality_too_low = quality_loss > float(policy["max_quality_delta_loss"])
outside_bootstrap_noise = e["within_bootstrap_noise"] is not True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow significant quality improvements

When candidate_score is higher than baseline_score but within_bootstrap_noise is false because the improvement is statistically significant, this flag still forces REQUIRE_DENSE_FALLBACK despite there being no quality loss and the max_quality_delta_loss policy passing. Only quality losses outside the allowed/noise envelope should block index-first authorization.

Useful? React with 👍 / 👎.

Comment thread tools/verify_index_authority.py Outdated
Comment on lines +59 to +61
if not isinstance(value, (int, float)) or isinstance(value, bool):
die(f"{label}.{key} must be a number")
return float(value)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-finite numeric fields

Python's JSON parser accepts NaN, and this check treats float('nan') as a valid number. If a receipt puts NaN in declared fields like storage_reduction_x or single_query_speedup_x, the later mismatch and threshold comparisons all evaluate false, so the verifier can still print verified: true with non-finite metrics.

Useful? React with 👍 / 👎.

Comment on lines +166 to +167
scope_missing = not applies_to or not does_not_claim
limitations_missing = not limitations

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat blank scope entries as missing

With require_scope and require_limitations enabled, this only checks whether the lists are truthy, so applies_to: [""], does_not_claim: [""], and limitations: [""] still authorize as a scoped claim. That lets a receipt omit any meaningful workload envelope while satisfying the default policy; validate that these lists contain non-empty string entries.

Useful? React with 👍 / 👎.

Comment on lines +179 to +182
claims_parallel_or_production = any(
marker in claims_text
for marker in ["parallel", "threaded", "production", "prod", "serving", "online"]
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require HNSW only for parallel claims

The policy flag is specifically require_hnsw_comparison_for_parallel_claims, and the receipt text says HNSW is required for highly parallel threaded serving claims, but these markers also trigger on any production, serving, or online scope. A receipt scoped to single-query production serving would therefore be rejected with REQUIRE_HNSW_COMPARISON even though it is not making a parallel-throughput claim.

Useful? React with 👍 / 👎.

@blocksifrdev

blocksifrdev commented Jun 21, 2026 via email

Copy link
Copy Markdown
Author

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a6c18c60b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment on lines +251 to +254
economics_too_weak = (
declared_storage < float(policy["min_storage_reduction_x"])
or declared_speedup < float(policy["min_single_query_speedup_x"])
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use recomputed ratios for threshold decisions

Fresh evidence after the ratio cross-check is that the verifier still applies policy thresholds to the receipt's declared ratios, not the recomputed expected_storage/expected_speedup. Because the mismatch tolerance is ±0.02, a receipt with true speedup 1.24 can declare 1.2501, pass the cross-check, and get ALLOW_INDEX_FIRST against the default 1.25 threshold even though the benchmark data is below policy; make the decision from the recomputed ratios or compare thresholds against them.

Useful? React with 👍 / 👎.

Comment on lines +123 to +127
has_metric_pair = any(
isinstance(h.get(a), (int, float))
and isinstance(h.get(b), (int, float))
and math.isfinite(float(h.get(a)))
and math.isfinite(float(h.get(b)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject boolean HNSW metrics

Fresh evidence in the current implementation is that these isinstance(..., (int, float)) checks count booleans as numbers, so a parallel-claim receipt with a non-empty artifact plus baseline_latency_ms: true and candidate_latency_ms: false satisfies has_metric_pair and skips REQUIRE_HNSW_COMPARISON without any real HNSW measurements. Exclude bool (and apply the same numeric validation to nested latency) before treating the comparison as concrete.

Useful? React with 👍 / 👎.

Comment on lines +168 to +170
"min_storage_reduction_x",
"min_single_query_speedup_x",
"max_quality_delta_loss",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate policy thresholds before comparison

The policy input is only checked for key presence here, and the later float(...) conversions accept JSON strings such as "NaN"; comparisons against NaN are false, so a custom --policy with "min_storage_reduction_x": "NaN" or "max_quality_delta_loss": "NaN" can make receipts with weak economics or large quality loss verify as ALLOW_INDEX_FIRST. Validate these policy fields as finite non-boolean numbers before computing the decision.

Useful? React with 👍 / 👎.

Comment on lines +247 to +249
quality_loss = max(0.0, baseline_score - candidate_score)
outside_bootstrap_noise = e["within_bootstrap_noise"] is not True
quality_too_low = quality_loss > float(policy["max_quality_delta_loss"]) and outside_bootstrap_noise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce max quality-loss cap even within noise

Because this combines the loss cap and bootstrap flag with and, any receipt that sets within_bootstrap_noise to true bypasses max_quality_delta_loss entirely. For example, with the default policy a candidate score far below baseline still verifies as ALLOW_INDEX_FIRST as long as the receipt marks it within bootstrap noise; the configured maximum loss should remain a hard cap, with the noise flag only affecting losses inside that cap or a separate policy check.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger OpenHands PR review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants