You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify Context Contract and Schema Convention: Dataset-Level vs. Sample-Level Information
Short summary
Section 1.1 of the HAIG API Standardisation Proposal establishes a context contract and schema convention but is incomplete — what information is permissible at each context level, and the practical schema conventions that follow, remain to be fully resolved.
What is the idea or problem?
Section 1.1 of the HAIG API Standardisation Proposal defines two context levels — sample-level and dataset-level — and provides an initial indication of what belongs at each, but this is not complete. Two related but separable questions need to be resolved:
Part 1 — Context Contract: what is permitted at each level
What can persist at Context Level 2 (dataset-level) and what must remain at Context Level 1 (sample-level) is use-case dependent — the appropriate level is not an intrinsic property of the information itself but depends on how and where it is consumed. Two examples:
Annotation cache: appropriate as persistent in deployment, but must be released one at a time in evaluation to prevent backdoor alignment with reference labels
Semantic label dictionary (Provide a well defined terminlogy to encode label meaning semantically #1868): if translated into the payload by the front-end, it need only exist at the sample level; if translation is deferred to the backend, the dictionary itself may need to be accessible at the dataset level
The image_cache is an example of something unambiguously dataset-level regardless of use-case.
Part 2 — Schema Convention: what is actually included and how
The schema is a reflection of the context contract — given what is permitted, it declares what is included. However, practical convention may deviate from the natural context level of information for reasons of simplicity. For example, a channel-to-acquisition-protocol mapping may be dataset-level by nature but passed at the sample level to minimise overhead.
Why does it matter?
Without a complete context contract, the schema cannot be correctly specified, and downstream architectural decisions — such as what requires persistent storage vs. stateless per-request passing — cannot be made reliably. This affects algorithm designers, front-end engineers, and evaluation framework maintainers.
Any context, examples, or references?
HAIG API Standardisation Proposal, Section 1.1 — Context and schema contract
Clarify Context Contract and Schema Convention: Dataset-Level vs. Sample-Level Information
Short summary
Section 1.1 of the HAIG API Standardisation Proposal establishes a context contract and schema convention but is incomplete — what information is permissible at each context level, and the practical schema conventions that follow, remain to be fully resolved.
What is the idea or problem?
Section 1.1 of the HAIG API Standardisation Proposal defines two context levels — sample-level and dataset-level — and provides an initial indication of what belongs at each, but this is not complete. Two related but separable questions need to be resolved:
Part 1 — Context Contract: what is permitted at each level
What can persist at Context Level 2 (dataset-level) and what must remain at Context Level 1 (sample-level) is use-case dependent — the appropriate level is not an intrinsic property of the information itself but depends on how and where it is consumed. Two examples:
The
image_cacheis an example of something unambiguously dataset-level regardless of use-case.Part 2 — Schema Convention: what is actually included and how
The schema is a reflection of the context contract — given what is permitted, it declares what is included. However, practical convention may deviate from the natural context level of information for reasons of simplicity. For example, a channel-to-acquisition-protocol mapping may be dataset-level by nature but passed at the sample level to minimise overhead.
Why does it matter?
Without a complete context contract, the schema cannot be correctly specified, and downstream architectural decisions — such as what requires persistent storage vs. stateless per-request passing — cannot be made reliably. This affects algorithm designers, front-end engineers, and evaluation framework maintainers.
Any context, examples, or references?
How would you like to be involved?