Skip to content

Conversation

@robbat2
Copy link
Contributor

@robbat2 robbat2 commented Jan 10, 2026

Closes: #234
Co-authored-by: Codex codex@openai.com
Generated-with: OpenAI Codex CLI (partial)
Signed-off-by: Robin H. Johnson rjohnson@coreweave.com

@robbat2 robbat2 force-pushed the rjohnson/http-fetch-validate branch from 050993e to eca24c0 Compare January 10, 2026 16:04
@robbat2
Copy link
Contributor Author

robbat2 commented Jan 10, 2026

Interesting heisenbug failure - the URL is fine, but it had a 404 when the test ran?

435/451 Test #435: JSONSchema.ci.fail_bundle_http_non_schema_verbose .............................***Failed    0.60 sec
2,3c2,5
< error: HTTP 404
<   at https://schemas.sourcemeta.com/self/v1/api/schemas/stats/jsonschema/2020-12/schema
---
> error: The JSON document is not a valid JSON Schema
>   at identifier https://schemas.sourcemeta.com/self/v1/api/schemas/stats/jsonschema/2020-12/schema
>   at file path /tmp/tmp.psSZLWYWpc/schema.json
>   at location "/allOf/0/$ref"

        Start 439: JSONSchema.ci.fail_validate_http_non_schema_verbose
436/451 Test #431: JSONSchema.ci.pass_bundle_http ................................................   Passed    0.74 sec

@jviotti
Copy link
Member

jviotti commented Jan 13, 2026

Interesting feature! Thanks for sending it. I'll book some time today to properly review it 🙏🏻

Copy link
Member

@jviotti jviotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea. I think the way Codex/you are implementing it is fighting the current machinery for iterating on schema arguments. I think we need for make for_each_json transparently work with both URLs or file paths. The src/input.h file abstracts a way a lot of the logic, including line number gathering, reading, etc so that the actual commands never have to worry about it.

for_each_json will loop and pass specific positional arguments to handle_json_entry. It feels to me that such handle_json_entry needs to take either a path or a URI (maybe just a string?) and then internally check: if it is a path and it exists and it is a file, do X, if it is a directory, do Y, if it is an HTTP/S URL, do Z.

I think if we do all of that there, we very elegantly gain URL support in every applicable command.

@jviotti
Copy link
Member

jviotti commented Jan 14, 2026

On the transient CI failure, yeah, we've been getting that quite a bit recently. I'm on it right now!

@jviotti
Copy link
Member

jviotti commented Jan 14, 2026

If you merge main back in, the transient CI issue should be gone!

@robbat2 robbat2 force-pushed the rjohnson/http-fetch-validate branch 2 times, most recently from 6bf416d to 3d93e71 Compare January 14, 2026 20:35
@robbat2 robbat2 requested a review from jviotti January 14, 2026 20:40
Closes: sourcemeta#234
Co-authored-by: Codex <codex@openai.com>
Generated-with: OpenAI Codex CLI (partial)
Signed-off-by: Robin H. Johnson <rjohnson@coreweave.com>
@robbat2 robbat2 force-pushed the rjohnson/http-fetch-validate branch from 3d93e71 to 7ced116 Compare January 14, 2026 20:43
@robbat2
Copy link
Contributor Author

robbat2 commented Jan 14, 2026

couple more CI bumps from the code but ready again; nothing transient this time

Copy link
Member

@jviotti jviotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I also wonder if there is a way to split up this PR so it's easy to merge the core logic first and avoid git conflicts, but not sure what that would be 🤔

-> const std::filesystem::path & {
if (!this->path.has_value()) {
std::ostringstream error;
error << "The `" << command << "` command does not support HTTP inputs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we never dynamically construct strings for exceptions. Instead, use one of the existing error classes that would be handled in src/error.h or create a new exception class there

}

[[nodiscard]] auto base_uri() const -> std::string {
if (sourcemeta::jsonschema::is_http_url(this->first)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (sourcemeta::jsonschema::is_http_url(this->first)) {
if (is_http_url(this->first)) {

You are already in the sourcemeta::jsonschema:: namespace, so you can omit it in every case.

return this->path.value();
}

[[nodiscard]] auto base_uri() const -> std::string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called base_uri but it doesn't return a base URI, but the whole file URI given the path?


struct InputJSON {
std::filesystem::path first;
std::string first;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the thing here is that the first might be either a path or a URI, maybe a std::variant expresses this better

const auto path_without_query_or_fragment =
sourcemeta::jsonschema::uri_path_without_query_or_fragment(
entry_identifier);
if (path_without_query_or_fragment.ends_with(".jsonl")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond just the extension, I think we need to look at the content type mime header?

{ "name": "foo", "kind": "example" }
EOF

"$1" validate "http://127.0.0.1:$PORT/schema.json" "$TMP/instance.json" --http \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now wondering if a command invocation to an HTTP URL should assume --http. I think it makes sense?

"$1" validate "http://127.0.0.1:$PORT/schema.json?ignored=1" "$TMP/instance.json" \
--http 2>"$TMP/stderr.txt"

cat << EOF > "$TMP/expected.txt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run all of these that present no output with --verbose and actually capture BOTH stdout and stderr? i.e. if for whatever reason this command printed non-sense to stdout, we would never know

SCHEMA_PATH="$(cd "$TMP" && pwd)/schema.json"
SCHEMA_URI="file://$SCHEMA_PATH"

"$1" lint "$SCHEMA_URI" --json >"$TMP/output.json" 2>&1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a _json suffix to the test file as this is only covering the JSON part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this test, but I'm wondering now if it is a bit confusing. I think if any input is an HTTP URL, we should assume --http, as that's clearly the user's intent?

EOF

diff "$TMP/output.txt" "$TMP/expected.txt"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like various test files have this empty line here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend validate/metaschema to directly take a URL to a schema to validate

2 participants