fix(math): split multi-char m:r text per character to match Word (SD-2632)#2875
Open
caio-pizzol wants to merge 4 commits intomainfrom
Open
fix(math): split multi-char m:r text per character to match Word (SD-2632)#2875caio-pizzol wants to merge 4 commits intomainfrom
caio-pizzol wants to merge 4 commits intomainfrom
Conversation
…2632) Word's OMML2MML.XSL classifies each character in an m:r run individually — digits group into a single <mn> (with one optional decimal point between digits), operator characters each become their own <mo>, and everything else becomes its own <mi>. SuperDoc was emitting the entire run text as one element, so runs like "→∞" or "x+1" rendered as a single <mi>, losing operator spacing and semantics. tokenizeMathText implements the per-character classification. convertMathRun returns a single Element for one-atom runs and a DocumentFragment when multiple atoms are emitted, so siblings flow directly into the parent's <mrow> without an extra wrapper. m:fName is the documented exception — Word keeps multi-letter function names like "sin" or "lim" as one <mi> inside the function-name slot. convertFunction routes m:r children through convertMathRunWhole (no splitting), and a new collapseFunctionNameBases pass re-merges the base slot of any structural MathML element (munder/mover/msub/…) that Word nested inside m:fName — without this, "lim" inside m:limLow would incorrectly split to three <mi>. Also drops U+221E (∞) from OPERATOR_CHARS — it's a mathematical constant per Word's XSL, not an operator. MathObjectConverter's return type widens from Element|null to Node|null so convertMathRun can return a DocumentFragment. All other converters already return Element, which is assignable to Node — no other changes. Verified against real Word-native fixtures: `→∞` in the limit-tests fixture case 1 now renders as <mi>n</mi><mo>→</mo><mi>∞</mi> (matches Word OMML2MML.XSL byte-for-byte), and nested limits keep their function names intact. Ref ECMA-376 §22.1.2.116, Annex L.6.1.13, §22.1.2.58.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ceec55257
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…pts (SD-2632) Follow-up to /review feedback backed by fresh Word OMML2MML.XSL evidence: - `convertMathRunAsFunctionName` (renamed from `convertMathRunWhole` since "whole" no longer fits) now groups consecutive non-digit / non-operator characters into one <mi> while still splitting digits and operators. Word's XSL for `<m:fName><m:r>log_2</m:r></m:fName>` produces `<mi>log</mi><mo>_</mo><mn>2</mn>` — not `<mi>l</mi><mi>o</mi><mi>g</mi>…`. - `BASE_BEARING_ELEMENTS` gains `mmultiscripts` — Word emits it when an `m:sPre` sits inside `m:fName`; our base-collapse pass needs to know to merge the first-child <mi> run. - CONTRIBUTING.md now documents the widened `Node | null` return type. Tests added: - Direct `tokenizeMathText` edge cases: `.5` / `5.` / `1.2.3` / `2x+1` / consecutive operators / empty / standalone ∞. - m:fName mixed-content: `log_2` stays `<mi>log</mi><mo>_</mo><mn>2</mn>`. - Base collapse inside nested `m:sSub` under `m:fName`. - Base collapse inside nested `m:sPre` (mmultiscripts) under `m:fName`. - Behavior test tightened to pin the full 3-atom sequence for `n→∞`. Disputed during review and deferred with evidence: - Opus claim that standalone `m:limLow` with "lim" base regresses to italic: Word XSL itself splits "lim" per-char in that shape (with or without m:sty=p), so our output matches Word. - Codex claim that Arabic-Indic digits should be `<mn>`: Word XSL also classifies them as `<mi>`, so our behavior matches. - Non-BMP surrogate-pair support: edge case in extreme mathematical alphanumerics; Word XSL itself errored on U+1D465. Separate ticket worth.
Addresses Codex bot review on PR #2875. Astral-plane mathematical alphanumerics (e.g. U+1D465 mathematical italic x, U+1D7D9 mathematical double-struck 1) are UTF-16 surrogate pairs. Walking text by code unit split them into two half-pair <mi> atoms with invalid content. `codePointUnitLength` returns 2 when the current position starts a surrogate pair so tokenizeMathText and tokenizeFunctionNameText step across the full code point.
…s-render-as-mi-instead-of
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Word's OMML→MathML converter classifies each character in an
m:rrun individually — consecutive digits group into one<mn>, each operator character becomes its own<mo>, everything else becomes its own<mi>. SuperDoc was emitting the full run text as a single element, so→∞came out as<mi>→∞</mi>and lost operator spacing and semantics.tokenizeMathTextdoes per-char classification with digit grouping (one optional decimal point).convertMathRunreturns aDocumentFragmentwhen a run splits, so siblings flow directly into the parent<mrow>with no extra wrapper.m:fNameis the documented exception — multi-letter function names likesin/limstay whole.convertFunctionroutesm:rchildren through a newconvertMathRunWholehelper, andcollapseFunctionNameBasesre-merges the base slot of any structural element Word nested insidem:fName(e.g.m:limLow→<munder>).MathObjectConverter's return widens fromElement|nulltoNode|nullso the fragment return works without casting. All 19 other converters already returnElement, which is assignable.U+221E(∞) fromOPERATOR_CHARS— it's a constant per Word, not an operator.Verified against Word's own
OMML2MML.XSLacross 15+ input shapes — byte-identical output for bare math, and matching whole-name behavior insidem:fName.Rejected: narrowing the split to "only when the run contains an operator" (Option B in investigation). It would have fixed the reported
→∞case without cascading, but left SuperDoc rendering<mi>sin</mi>vs Word<mi>s</mi><mi>i</mi><mi>n</mi>for bare function names — and downstream rendering issues (operator spacing inside the run) would keep surfacing. Chose full parity.Review: the
collapseFunctionNameBasespass is the subtlest part — it walks the post-split tree and re-merges<mi>siblings in structural bases. Confirm theBASE_BEARING_ELEMENTSset covers every shape Word nests underm:fNamethat we've actually seen. Also worth checking that a same-variant requirement is correct (currently only merges when all siblings share the same mathvariant).Verified: 172/172 unit tests, 79/79 behavior tests, type-check clean. Pre-existing 3
m:groupChrtests updated — theirmunder.querySelector('mo')descendant-search now picks up split operators inside the base expression, so they were re-scoped to:scope > mo(the group character, which is what they were actually testing).