Skip to content

Add MergeDataSegments pass#8647

Draft
LegionMammal978 wants to merge 4 commits intoWebAssembly:mainfrom
LegionMammal978:merge-data-segments
Draft

Add MergeDataSegments pass#8647
LegionMammal978 wants to merge 4 commits intoWebAssembly:mainfrom
LegionMammal978:merge-data-segments

Conversation

@LegionMammal978
Copy link
Copy Markdown

@LegionMammal978 LegionMammal978 commented Apr 27, 2026

Recently, I was writing a WASM module by hand, and I used a number of individual small data segments to store strings. I tried seeing if wasm-opt could combine these adjacent data segments, but found that it did not have any such pass. Thus, I've implemented a new MergeDataSegments pass to merge active data segments that are overlapping, adjacent, or near-adjacent, in order to save the space of storing multiple data-segment headers, and the time of processing them during instantiation.

It is designed to be as aggressive as possible, fully supporting multiple memories and accounting for non-constant-offset segments. Meanwhile, unless TNH is enabled, it is also designed to carefully replicate the original module's behavior w.r.t. out-of-bounds traps during instantiation: the goal is that there should be no observable difference in the output, short of unreliable tricks like reading a partially-instantiated SharedArrayBuffer.

In principle, this functionality might have been included in the existing MemoryPacking pass, but I believe that it makes sense to separate the primary functionalities of splitting vs. merging data segments, which require different forms of tracking. In that sense, I see MergeDataSegments as complementing the MemoryPacking pass. For instance, MemoryPacking requires that its input has no overlapping data segments, a property that MergeDataSegments is often able to ensure in its output.

Some implementation notes:

  • Following the behavior of MemoryPacking, it reads TNH as asserting traps never happen during instantiation, so in particular data segments are never out of bounds. Also, it only considers TNH and not --ignore-implicit-traps, following MemoryPacking.
  • The pass detects certain cases when a data segment is necessarily out-of-bounds: in that case, it simply emits the offending data segment last, and drops all remaining data segments. In principle, since the module cannot be fully instantiated, it could be even more aggressive with replacing every function body with unreachable, etc., but this is a small edge case.
  • Following MemoryPacking, this pass assumes that unless a memory is imported, it is zero-initialized, and its initial and maximum sizes are exactly as declared. It also assumes that during data initialization, the memory cannot be modified by anything other than the declared data segments. It is my understanding that these assumptions are not affected by open-world vs. closed-world.
  • The pass tries to be careful about integer limits, with only one small edge case remaining: As far as I can read the WASM spec, it permits a 64-bit memory to be exactly 2^64 bytes long, yet the pass only handles memories up to 2^64-1 bytes long. But in general, binaryen doesn't seem to be designed in a way that would allow that last byte.
  • Overall, the pass modifies the module by rewriting its data segments, then modifying data-segment indices in function bodies. In the latter case, I assume that ReFinalize() is not needed, since the instructions and stack arguments are unchanged, only their indices are modified. Similarly, I mark requiresNonNullableLocalFixups() as false. In principle, the behavior of the memory instructions on active data segments can be simplified, but I figure that it's better to leave that to the more extensive modifications of MemoryPacking.
  • The active-segment threshold of MemoryPacking does not always match the size heuristic of MergeDataSegments, so if run in alternation in certain cases, they could fight over splitting vs. merging the same two segments.
  • Obviously, this pass has a lot of edge cases that need testing, but I'm not sure where the tests should be placed (test/passes/? test/lit/passes/?), nor how exactly they are formatted. I'm especially unsure how to test behavior around the MAX_SEG_SIZE.

@LegionMammal978 LegionMammal978 requested a review from a team as a code owner April 27, 2026 16:30
@LegionMammal978 LegionMammal978 requested review from kripken and removed request for a team April 27, 2026 16:30
@MaxGraey
Copy link
Copy Markdown
Contributor

MaxGraey commented Apr 27, 2026

It looks like all of this was created using an LLM? You should add tests to demonstrate how this data segments merges + check edge cases using lit tests (test/lit/passes/<some-pass-name>.wast). You also need to run fuzz tests for at least 5-7 hours.

@LegionMammal978
Copy link
Copy Markdown
Author

No, I wrote everything in this PR myself, I'm not a big fan of LLMs' coding style. (Indeed, I asked an LLM to review my merge functions, and it kept wanting to add all sorts of extraneous steps.)

I think I see how to write the lit tests (create the file with the inputs, run scripts/update_lit_tests.py, and double-check that the outputs are correct). But how do I run the fuzz tests?

@MaxGraey
Copy link
Copy Markdown
Contributor

But how do I run the fuzz tests?

run from root dir:

./scripts/fuzz_opt.py

@LegionMammal978 LegionMammal978 marked this pull request as draft April 27, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants