Skip to content

fix(math): split multi-char m:r text per character to match Word (SD-2632)#2875

Open
caio-pizzol wants to merge 4 commits intomainfrom
caio/sd-2632-math-multi-char-operator-strings-render-as-mi-instead-of
Open

fix(math): split multi-char m:r text per character to match Word (SD-2632)#2875
caio-pizzol wants to merge 4 commits intomainfrom
caio/sd-2632-math-multi-char-operator-strings-render-as-mi-instead-of

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

Word's OMML→MathML converter classifies each character in an m:r run individually — consecutive digits group into one <mn>, each operator character becomes its own <mo>, everything else becomes its own <mi>. SuperDoc was emitting the full run text as a single element, so →∞ came out as <mi>→∞</mi> and lost operator spacing and semantics.

  • tokenizeMathText does per-char classification with digit grouping (one optional decimal point). convertMathRun returns a DocumentFragment when a run splits, so siblings flow directly into the parent <mrow> with no extra wrapper.
  • m:fName is the documented exception — multi-letter function names like sin/lim stay whole. convertFunction routes m:r children through a new convertMathRunWhole helper, and collapseFunctionNameBases re-merges the base slot of any structural element Word nested inside m:fName (e.g. m:limLow<munder>).
  • MathObjectConverter's return widens from Element|null to Node|null so the fragment return works without casting. All 19 other converters already return Element, which is assignable.
  • Drops U+221E (∞) from OPERATOR_CHARS — it's a constant per Word, not an operator.

Verified against Word's own OMML2MML.XSL across 15+ input shapes — byte-identical output for bare math, and matching whole-name behavior inside m:fName.

Rejected: narrowing the split to "only when the run contains an operator" (Option B in investigation). It would have fixed the reported →∞ case without cascading, but left SuperDoc rendering <mi>sin</mi> vs Word <mi>s</mi><mi>i</mi><mi>n</mi> for bare function names — and downstream rendering issues (operator spacing inside the run) would keep surfacing. Chose full parity.
Review: the collapseFunctionNameBases pass is the subtlest part — it walks the post-split tree and re-merges <mi> siblings in structural bases. Confirm the BASE_BEARING_ELEMENTS set covers every shape Word nests under m:fName that we've actually seen. Also worth checking that a same-variant requirement is correct (currently only merges when all siblings share the same mathvariant).
Verified: 172/172 unit tests, 79/79 behavior tests, type-check clean. Pre-existing 3 m:groupChr tests updated — their munder.querySelector('mo') descendant-search now picks up split operators inside the base expression, so they were re-scoped to :scope > mo (the group character, which is what they were actually testing).

…2632)

Word's OMML2MML.XSL classifies each character in an m:r run individually —
digits group into a single <mn> (with one optional decimal point between
digits), operator characters each become their own <mo>, and everything
else becomes its own <mi>. SuperDoc was emitting the entire run text as
one element, so runs like "→∞" or "x+1" rendered as a single <mi>, losing
operator spacing and semantics.

tokenizeMathText implements the per-character classification. convertMathRun
returns a single Element for one-atom runs and a DocumentFragment when
multiple atoms are emitted, so siblings flow directly into the parent's
<mrow> without an extra wrapper.

m:fName is the documented exception — Word keeps multi-letter function
names like "sin" or "lim" as one <mi> inside the function-name slot.
convertFunction routes m:r children through convertMathRunWhole (no
splitting), and a new collapseFunctionNameBases pass re-merges the base
slot of any structural MathML element (munder/mover/msub/…) that Word
nested inside m:fName — without this, "lim" inside m:limLow would
incorrectly split to three <mi>.

Also drops U+221E (∞) from OPERATOR_CHARS — it's a mathematical constant
per Word's XSL, not an operator.

MathObjectConverter's return type widens from Element|null to Node|null
so convertMathRun can return a DocumentFragment. All other converters
already return Element, which is assignable to Node — no other changes.

Verified against real Word-native fixtures: `→∞` in the limit-tests
fixture case 1 now renders as <mi>n</mi><mo>→</mo><mi>∞</mi> (matches
Word OMML2MML.XSL byte-for-byte), and nested limits keep their function
names intact.

Ref ECMA-376 §22.1.2.116, Annex L.6.1.13, §22.1.2.58.
@linear
Copy link
Copy Markdown

linear bot commented Apr 20, 2026

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ceec55257

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/layout-engine/painters/dom/src/features/math/converters/math-run.ts Outdated
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

caio-pizzol and others added 3 commits April 20, 2026 18:18
…pts (SD-2632)

Follow-up to /review feedback backed by fresh Word OMML2MML.XSL evidence:

- `convertMathRunAsFunctionName` (renamed from `convertMathRunWhole` since
  "whole" no longer fits) now groups consecutive non-digit / non-operator
  characters into one <mi> while still splitting digits and operators. Word's
  XSL for `<m:fName><m:r>log_2</m:r></m:fName>` produces
  `<mi>log</mi><mo>_</mo><mn>2</mn>` — not `<mi>l</mi><mi>o</mi><mi>g</mi>…`.
- `BASE_BEARING_ELEMENTS` gains `mmultiscripts` — Word emits it when an
  `m:sPre` sits inside `m:fName`; our base-collapse pass needs to know to
  merge the first-child <mi> run.
- CONTRIBUTING.md now documents the widened `Node | null` return type.

Tests added:
- Direct `tokenizeMathText` edge cases: `.5` / `5.` / `1.2.3` / `2x+1` /
  consecutive operators / empty / standalone ∞.
- m:fName mixed-content: `log_2` stays `<mi>log</mi><mo>_</mo><mn>2</mn>`.
- Base collapse inside nested `m:sSub` under `m:fName`.
- Base collapse inside nested `m:sPre` (mmultiscripts) under `m:fName`.
- Behavior test tightened to pin the full 3-atom sequence for `n→∞`.

Disputed during review and deferred with evidence:
- Opus claim that standalone `m:limLow` with "lim" base regresses to italic:
  Word XSL itself splits "lim" per-char in that shape (with or without
  m:sty=p), so our output matches Word.
- Codex claim that Arabic-Indic digits should be `<mn>`: Word XSL also
  classifies them as `<mi>`, so our behavior matches.
- Non-BMP surrogate-pair support: edge case in extreme mathematical
  alphanumerics; Word XSL itself errored on U+1D465. Separate ticket worth.
Addresses Codex bot review on PR #2875. Astral-plane mathematical
alphanumerics (e.g. U+1D465 mathematical italic x, U+1D7D9
mathematical double-struck 1) are UTF-16 surrogate pairs. Walking
text by code unit split them into two half-pair <mi> atoms with
invalid content. `codePointUnitLength` returns 2 when the current
position starts a surrogate pair so tokenizeMathText and
tokenizeFunctionNameText step across the full code point.
@caio-pizzol caio-pizzol enabled auto-merge April 20, 2026 21:54
@caio-pizzol caio-pizzol added this pull request to the merge queue Apr 20, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants