Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,10 @@ task :templates do
sh "#{ruby} templates/template.rb include/rbs/ast.h"
sh "#{ruby} templates/template.rb src/ast.c"

sh "#{ruby} templates/template.rb include/rbs/serialize.h"
sh "#{ruby} templates/template.rb src/serialize.c"
sh "#{ruby} templates/template.rb lib/rbs/wasm/serialization_schema.rb"

# Format the generated files
Rake::Task["format:c"].invoke
end
Expand Down
80 changes: 80 additions & 0 deletions docs/wasm_serialization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# RBS AST binary serialization

This document describes the binary format used to move a parsed RBS AST out of
the parser and into Ruby objects without going through the Ruby C API. It exists
so that RBS can run on Ruby implementations that cannot load the C extension
(notably JRuby): the parser runs inside WebAssembly, serializes the result with
this format, and the host rebuilds `RBS::AST` objects in pure Ruby.

The encoder (`rbs_serialize_node`, `src/serialize.c`) and the schema that drives
the decoder (`RBS::WASM::SerializationSchema`, `lib/rbs/wasm/serialization_schema.rb`)
are both generated from `config.yml`, so they always agree. The decoder itself
is `RBS::WASM::Deserializer`.

## Conventions

- All multi-byte integers are **little-endian**.
- `u8`, `u32` are unsigned; `i32` is signed.
- `str` is a `u32` byte length followed by that many raw bytes (no terminator).
- A value is reconstructed to mirror exactly what `ast_translation.c` produces,
including string encodings: string/integer literal nodes are UTF-8, while
comments, annotations and symbols use the source buffer's encoding.

## Nodes

Every node begins with a `u8` **tag**:

- `0` — a NULL node (`nil` on the Ruby side).
- `1..N` — a node type, in the order they appear in `SerializationSchema::SCHEMA`.
- `SYMBOL_TAG` (`N + 1`) — an interned symbol, followed by `str` (the symbol's
bytes). Decoded with `String#to_sym`.

A few node types are encoded specially, matching their bespoke handling in
`ast_translation.c`:

| Node | Payload after tag | Decoded as |
| --- | --- | --- |
| `RBS::AST::Bool` | `u8` | `true` / `false` |
| `RBS::AST::Integer` | `str` | `String#to_i` |
| `RBS::AST::String` | `str` | the string (UTF-8) |
| `RBS::Types::Record::FieldType` | node, then `u8` | `[type, required]` |
| `RBS::Signature` | node-list, then node-list | `[directives, declarations]` |
| `RBS::Namespace` | node-list, then `u8` | `RBS::Namespace[path, absolute]` |
| `RBS::TypeName` | node, then node | `RBS::TypeName[namespace, name]` |

Every other node is encoded generically:

1. If the node exposes a location, its **base location** is written (see below),
followed by one location range per declared child, in order.
2. Each field is written in declaration order, encoded by its type (see below).

The decoder constructs `Klass.new(location:, **fields)` (omitting `location:`
for nodes that do not expose one). For `Class`, `Module`, `Interface`,
`TypeAlias` and `MethodType`, `RBS::AST::TypeParam.resolve_variables` is applied
to `type_params` first, exactly as the C translation does.

## Fields

| Field type | Encoding |
| --- | --- |
| node (`rbs_node`, `rbs_type_name`, `rbs_ast_comment`, `rbs_ast_symbol`, ...) | a node (recursive; NULL allowed) |
| `rbs_node_list` | `u32` count, then that many nodes |
| `rbs_hash` | `u32` count, then count × (key node, value node) |
| `rbs_string` | `str` (source encoding) |
| `bool` | `u8` |
| enum | `u8` index into the enum's values (see `SCHEMA`) |
| `rbs_location_range` | a location range |
| `rbs_location_range_list` | `u32` count, then that many location ranges |
| `rbs_attr_ivar_name` | `u8` tag: `0` → `nil`, `1` → `false`, `2` → `str` → symbol |

## Location ranges

A location range is a `u8` presence flag:

- `0` — null range (`nil`, or a node with no location).
- `1` — followed by `i32` start and `i32` end **character** positions.

The base location and child ranges together let the decoder rebuild an
`RBS::Location` (with its required/optional children) through the public
`RBS::Location` API, so the same decoder works whether `RBS::Location` is backed
by the C extension or a pure-Ruby implementation.
130 changes: 130 additions & 0 deletions ext/rbs_extension/main.c
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#include "rbs/util/rbs_assert.h"
#include "rbs/util/rbs_allocator.h"
#include "rbs/util/rbs_constant_pool.h"
#include "rbs/serialize.h"
#include "ast_translation.h"
#include "legacy_location.h"
#include "rbs_string_bridging.h"
Expand Down Expand Up @@ -290,6 +291,132 @@ static VALUE rbsparser_parse_signature(VALUE self, VALUE buffer, VALUE start_pos
return result;
}

// Serialize a parsed node into a binary Ruby string using the same encoder the
// WebAssembly build uses. These `_*_to_bytes` entry points exist so the
// round-trip (parse -> serialize -> deserialize) can be exercised on CRuby,
// where it can be compared against the direct C -> Ruby translation.
static VALUE serialized_node_to_string(rbs_parser_t *parser, rbs_node_t *node) {
rbs_string_t bytes = rbs_serialize_node(parser->allocator, &parser->constant_pool, node);
return rb_str_new(bytes.start, (long) rbs_string_len(bytes));
}

static VALUE parse_type_to_bytes_try(VALUE a) {
struct parse_type_arg *arg = (struct parse_type_arg *) a;
rbs_parser_t *parser = arg->parser;

if (parser->next_token.type == pEOF) {
return Qnil;
}

rbs_node_t *type;
rbs_parse_type(parser, &type, RTEST(arg->void_allowed), RTEST(arg->self_allowed), RTEST(arg->classish_allowed));

raise_error_if_any(parser, arg->buffer);

if (RB_TEST(arg->require_eof)) {
rbs_parser_advance(parser);
if (parser->current_token.type != pEOF) {
rbs_parser_set_error(parser, parser->current_token, true, "expected a token `%s`", rbs_token_type_str(pEOF));
raise_error(parser->error, arg->buffer);
}
}

return serialized_node_to_string(parser, type);
}

static VALUE rbsparser_parse_type_to_bytes(VALUE self, VALUE buffer, VALUE start_pos, VALUE end_pos, VALUE variables, VALUE require_eof, VALUE void_allowed, VALUE self_allowed, VALUE classish_allowed) {
VALUE string = rb_funcall(buffer, rb_intern("content"), 0);
StringValue(string);
rb_encoding *encoding = rb_enc_get(string);

rbs_parser_t *parser = alloc_parser_from_buffer(buffer, FIX2INT(start_pos), FIX2INT(end_pos));
declare_type_variables(parser, variables, buffer);
struct parse_type_arg arg = {
.buffer = buffer,
.encoding = encoding,
.parser = parser,
.require_eof = require_eof,
.void_allowed = void_allowed,
.self_allowed = self_allowed,
.classish_allowed = classish_allowed
};

VALUE result = rb_ensure(parse_type_to_bytes_try, (VALUE) &arg, ensure_free_parser, (VALUE) parser);

RB_GC_GUARD(string);

return result;
}

static VALUE parse_method_type_to_bytes_try(VALUE a) {
struct parse_method_type_arg *arg = (struct parse_method_type_arg *) a;
rbs_parser_t *parser = arg->parser;

if (parser->next_token.type == pEOF) {
return Qnil;
}

rbs_method_type_t *method_type = NULL;
rbs_parse_method_type(parser, &method_type, RB_TEST(arg->require_eof), true);

raise_error_if_any(parser, arg->buffer);

return serialized_node_to_string(parser, (rbs_node_t *) method_type);
}

static VALUE rbsparser_parse_method_type_to_bytes(VALUE self, VALUE buffer, VALUE start_pos, VALUE end_pos, VALUE variables, VALUE require_eof) {
VALUE string = rb_funcall(buffer, rb_intern("content"), 0);
StringValue(string);
rb_encoding *encoding = rb_enc_get(string);

rbs_parser_t *parser = alloc_parser_from_buffer(buffer, FIX2INT(start_pos), FIX2INT(end_pos));
declare_type_variables(parser, variables, buffer);
struct parse_method_type_arg arg = {
.buffer = buffer,
.encoding = encoding,
.parser = parser,
.require_eof = require_eof
};

VALUE result = rb_ensure(parse_method_type_to_bytes_try, (VALUE) &arg, ensure_free_parser, (VALUE) parser);

RB_GC_GUARD(string);

return result;
}

static VALUE parse_signature_to_bytes_try(VALUE a) {
struct parse_signature_arg *arg = (struct parse_signature_arg *) a;
rbs_parser_t *parser = arg->parser;

rbs_signature_t *signature = NULL;
rbs_parse_signature(parser, &signature);

raise_error_if_any(parser, arg->buffer);

return serialized_node_to_string(parser, (rbs_node_t *) signature);
}

static VALUE rbsparser_parse_signature_to_bytes(VALUE self, VALUE buffer, VALUE start_pos, VALUE end_pos) {
VALUE string = rb_funcall(buffer, rb_intern("content"), 0);
StringValue(string);
rb_encoding *encoding = rb_enc_get(string);

rbs_parser_t *parser = alloc_parser_from_buffer(buffer, FIX2INT(start_pos), FIX2INT(end_pos));
struct parse_signature_arg arg = {
.buffer = buffer,
.encoding = encoding,
.parser = parser,
.require_eof = false
};

VALUE result = rb_ensure(parse_signature_to_bytes_try, (VALUE) &arg, ensure_free_parser, (VALUE) parser);

RB_GC_GUARD(string);

return result;
}

struct parse_type_params_arg {
VALUE buffer;
rb_encoding *encoding;
Expand Down Expand Up @@ -462,6 +589,9 @@ void rbs__init_parser(void) {
rb_define_singleton_method(RBS_Parser, "_parse_type", rbsparser_parse_type, 8);
rb_define_singleton_method(RBS_Parser, "_parse_method_type", rbsparser_parse_method_type, 5);
rb_define_singleton_method(RBS_Parser, "_parse_signature", rbsparser_parse_signature, 3);
rb_define_singleton_method(RBS_Parser, "_parse_type_to_bytes", rbsparser_parse_type_to_bytes, 8);
rb_define_singleton_method(RBS_Parser, "_parse_method_type_to_bytes", rbsparser_parse_method_type_to_bytes, 5);
rb_define_singleton_method(RBS_Parser, "_parse_signature_to_bytes", rbsparser_parse_signature_to_bytes, 3);
rb_define_singleton_method(RBS_Parser, "_parse_type_params", rbsparser_parse_type_params, 4);
rb_define_singleton_method(RBS_Parser, "_parse_inline_leading_annotation", rbsparser_parse_inline_leading_annotation, 4);
rb_define_singleton_method(RBS_Parser, "_parse_inline_trailing_annotation", rbsparser_parse_inline_trailing_annotation, 4);
Expand Down
33 changes: 33 additions & 0 deletions include/rbs/serialize.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*----------------------------------------------------------------------------*/
/* This file is generated by the templates/template.rb script and should not */
/* be modified manually. */
/* To change the template see */
/* templates/include/rbs/serialize.h.erb */
/*----------------------------------------------------------------------------*/

#ifndef RBS__SERIALIZE_H
#define RBS__SERIALIZE_H

#include "rbs/ast.h"
#include "rbs/string.h"
#include "rbs/util/rbs_allocator.h"
#include "rbs/util/rbs_constant_pool.h"

/**
* Serialize a parsed AST node into a compact, portable binary buffer.
*
* The format is consumed by RBS::WASM::Deserializer on the Ruby side, which
* rebuilds the same `RBS::AST` objects that the C extension would have built
* directly. This is what lets RBS run on Ruby implementations that cannot load
* the C extension (notably JRuby): the parser runs inside WebAssembly, produces
* this buffer, and the host reconstructs the tree in pure Ruby.
*
* The buffer is allocated from `allocator`, so its lifetime is tied to that
* allocator. `constant_pool` must be the pool the node was parsed with; it is
* used to resolve interned symbol/identifier ids back into their bytes.
*
* See `docs/wasm_serialization.md` for the wire format.
*/
rbs_string_t rbs_serialize_node(rbs_allocator_t *allocator, rbs_constant_pool_t *constant_pool, rbs_node_t *node);

#endif
Loading
Loading