Skip to content

[HAI] Persistent vs Sample Level Manifest protocol #1904

@ParhomEsmaeili

Description

@ParhomEsmaeili

Clarify Context Contract and Schema Convention: Dataset-Level vs. Sample-Level Information

Short summary

Section 1.1 of the HAIG API Standardisation Proposal establishes a context contract and schema convention but is incomplete — what information is permissible at each context level, and the practical schema conventions that follow, remain to be fully resolved.

What is the idea or problem?

Section 1.1 of the HAIG API Standardisation Proposal defines two context levels — sample-level and dataset-level — and provides an initial indication of what belongs at each, but this is not complete. Two related but separable questions need to be resolved:

Part 1 — Context Contract: what is permitted at each level

What can persist at Context Level 2 (dataset-level) and what must remain at Context Level 1 (sample-level) is use-case dependent — the appropriate level is not an intrinsic property of the information itself but depends on how and where it is consumed. Two examples:

  • Annotation cache: appropriate as persistent in deployment, but must be released one at a time in evaluation to prevent backdoor alignment with reference labels
  • Semantic label dictionary (Provide a well defined terminlogy to encode label meaning semantically #1868): if translated into the payload by the front-end, it need only exist at the sample level; if translation is deferred to the backend, the dictionary itself may need to be accessible at the dataset level

The image_cache is an example of something unambiguously dataset-level regardless of use-case.

Part 2 — Schema Convention: what is actually included and how

The schema is a reflection of the context contract — given what is permitted, it declares what is included. However, practical convention may deviate from the natural context level of information for reasons of simplicity. For example, a channel-to-acquisition-protocol mapping may be dataset-level by nature but passed at the sample level to minimise overhead.

Why does it matter?

Without a complete context contract, the schema cannot be correctly specified, and downstream architectural decisions — such as what requires persistent storage vs. stateless per-request passing — cannot be made reliably. This affects algorithm designers, front-end engineers, and evaluation framework maintainers.

Any context, examples, or references?

How would you like to be involved?

  • I can contribute code or documentation
  • I can test or provide feedback
  • I want to follow the discussion
  • I have other ideas or expertise to share: _____________

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions