Skip to content

Add binary output format RFC#58

Open
sims1253 wants to merge 3 commits intostan-dev:masterfrom
sims1253:master
Open

Add binary output format RFC#58
sims1253 wants to merge 3 commits intostan-dev:masterfrom
sims1253:master

Conversation

@sims1253
Copy link

@sims1253 sims1253 commented Mar 9, 2026

No description provided.

@jgabry
Copy link
Member

jgabry commented Mar 9, 2026

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @sims1253!

A bunch of comments to start. Note that if I could wave a wand and get exactly what you wrote, I'd take it! But I do thing there are some minor improvements possible


2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?

3. Is the trailing metadata section (raw CSV comment text) the right long-term metadata representation, or should v1 adopt a structured format (e.g., key-value pairs) from the start?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is either start as structured key-value metadata, or cut it entirely, but I'm mostly ambivalent (cmdstanpy can just stick to reading this information from the json version cmdstan can provide)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this is also a good opportunity to have the metadata go to a separate file completly. So we would write a little json file with all the metadata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if there is some kind of consensus here. I thought having it as one file was desired. 2 files would actually be easier to implement :D My first draft of this actually used a separate file for the metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an argument called save_cmdstan_config which creates a json file equivalent of the opening comments. The adaptation metadata is also saved elsewhere, so it’s really only timing that is still available only as a comment, and we’d also like to change that (stan-dev/stan#3340)

@sims1253
Copy link
Author

Just a quick question re. the process: There are a few things that I would just count as oversights in the proposal that I would simply fix/adapt with a new commit. Is there anything to consider there or do I just push a new commit?

@WardBrian
Copy link
Member

Yep, until the PR is merged it is open for modification via normal commits during the discussions

@SteveBronder
Copy link
Collaborator

(one minor note after addressing @WardBrian 's comments): For markdown it's nice to start sentence on a newline so that reviewers can comment on individual sentences easier. Markdown only puts in a true \n newline into the rendered document if there is a full space between lines.

i.e. this will be rendered on the same line

The quick brown fox.
Jumps over the lazy dog.

but this will be on rendered with a newline

The quick brown fox.

Jumps over the lazy dog.

You should be able to just do a regex find and replace to replace . with . \n

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! imo the only questions from me, besides below, are things we will resolve during the PR

Comment on lines +147 to +159
| Offset | Size | Type | Description |
|--------|------|------|-------------|
| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |
| 8 | 4 | uint32 | Version (`1`) |
| 12 | 4 | uint32 | Flags (`0` in v1; reserved for extensions) |
| 16 | 8 | uint64 | Number of rows (draws) |
| 24 | 8 | uint64 | Number of columns (parameters) |
| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |
| 36 | 4 | uint32 | Names section size in bytes |
| 40 | 4 | uint32 | Layout parameter (`0` = row-major in v1; non-zero values reserved for extensions such as chunking) |
| 44 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |
| 52 | 8 | uint64 | Metadata section size in bytes |
| 60 | 4 | reserved | Reserved for future use |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things here.

  1. Do we expect the version to need a uint32? We could probably just do uint8 here since idt we will exceed 255 versions
  2. If we are treating flags like a bitset then I think we can remove Layout and specify is as a flag bit. I think making the flag uint64 would also be nice since that gives us 64 options to choose from in the future
  3. I think the metadata section size should just be a uint32. If the size is in bytes idt we will ever go over 4GB of metadata
Suggested change
| Offset | Size | Type | Description |
|--------|------|------|-------------|
| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |
| 8 | 4 | uint32 | Version (`1`) |
| 12 | 4 | uint32 | Flags (`0` in v1; reserved for extensions) |
| 16 | 8 | uint64 | Number of rows (draws) |
| 24 | 8 | uint64 | Number of columns (parameters) |
| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |
| 36 | 4 | uint32 | Names section size in bytes |
| 40 | 4 | uint32 | Layout parameter (`0` = row-major in v1; non-zero values reserved for extensions such as chunking) |
| 44 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |
| 52 | 8 | uint64 | Metadata section size in bytes |
| 60 | 4 | reserved | Reserved for future use |
| Offset (bytes) | Size (bytes) | Type | Description |
|--------|------|------|-------------|
| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |
| 8 | 8 | uint64 | Flags (see flags below) |
| 16 | 8 | uint64 | Number of rows (draws) |
| 24 | 8 | uint64 | Number of columns (parameters) |
| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |
| 36 | 4 | uint32 | Names section size in bytes |
| 40 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |
| 48 | 4 | uint32 | Metadata section size in bytes |
| 52 | 1 | uint8 | Version (`1`) |
| 53 | 11 | reserved | Reserved for future use |
Flags for each byte
`STAN_CHUNKING_FORMAT`: Specifies data form is in Stan's chunking format
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incorporated the metadata-size change to uint32. I left the current version/flags/layout split in place for now because it still reads a bit more directly in the RFC, but I’m happy to revisit that if there is a stronger preference to collapse layout into flags.

Comment on lines +260 to +262
write_row(state);
++num_rows_;
update_rows_in_header(num_rows_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a return flag from write_row so we only update the number of rows / the header if we successfully wrote

Suggested change
write_row(state);
++num_rows_;
update_rows_in_header(num_rows_);
const bool write_success = write_row(state);
if (write_success) {
++num_rows_;
update_rows_in_header(num_rows_);
}

Comment on lines +266 to +267
write_metadata();
rewrite_header();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above I'd have each operation return a flag so you know if writing was successful or not. You could also have an enum that you return for different error codes.

Comment on lines +280 to +281
stream_.write(reinterpret_cast<const char*>(state.data()),
state.size() * sizeof(double));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream_.write(reinterpret_cast<const char*>(state.data()),
state.size() * sizeof(double));
stream_.write(reinterpret_cast<const std::byte*>(state.data()),
state.size() * sizeof(double));

(generally I just like this and think it is more standard now)


2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?

3. Is the trailing metadata section (raw CSV comment text) the right long-term metadata representation, or should v1 adopt a structured format (e.g., key-value pairs) from the start?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this is also a good opportunity to have the metadata go to a separate file completly. So we would write a little json file with all the metadata.

@ahartikainen
Copy link

Hi, I wanted to add a comment on the external metadata issue. I like the idea of having 1 file, but I think external metadata would be much more flexible way of handling it.

We could copy idea from zarr world, where they put multiple files in uncompressed zip. I think you can stream data to the zip file and append it too.

In the reading side, accessing the metadata would be easy.

And this would still keep the file count as 1.

@sims1253
Copy link
Author

I pushed a revision addressing the straightforward spec fixes from the review. I left a few things out for now as I wasn't sure which way would be the best choice. Happy to adjust things further.
Re. the example code feedback, I meant the code mainly as illustrations. I fixed some obvious oversights but I think it might make more sense to keep implementation details light here?

How do you prefer handling of resolved comments? Should I resolve them if I think I addressed them or should it be up to the authors?

@WardBrian
Copy link
Member

I generally reply or react with a 👍🏻 and let the original author resolve

@WardBrian
Copy link
Member

WardBrian commented Mar 12, 2026

I fixed some obvious oversights but I think it might make more sense to keep implementation details light here?

Yeah, I think some questions like "should the header be updated after every single row is written" should be answered, but things like "should we use char or std::byte" are probably fine to leave until later (sorry @SteveBronder!)...

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed one small thing in the revisions but I'm otherwise quite happy, modulo the ongoing discussion of metadata

@SteveBronder
Copy link
Collaborator

but things like "should we use char or std::byte" are probably fine to leave until later (sorry @SteveBronder!)...

No thats fine! I was just commenting as I saw things. That is why I said most of my Qs are things we would handle in the PR review process

Co-authored-by: Brian Ward <brianmward99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants