add UNICODE_CHARACTER_TO_ASCII and CHARACTER_TO_NAME tables by mmatera · Pull Request #169 · Mathics3/Mathics3-scanner

mmatera · 2026-03-29T16:09:18Z

This PR adds the tables we need to use in Mathics-core to implement the ASCII encodings.

mmatera · 2026-03-29T16:10:07Z

mathics_scanner/data/named-characters.yml

 CapitalDifferentialD:
  amslatex: '\mathbb{D}'
-  ascii: "d"
+  ascii: "D"


This is the symbol used in path integrals, with a capital D in the differential.

mmatera · 2026-03-29T16:10:19Z

mathics_scanner/data/named-characters.yml

 # \u2146
 DifferentialD:
  amslatex: '\,d'
+  ascii: "d"


This was missing.

rocky · 2026-03-29T21:46:20Z

mathics_scanner/characters.py

+# TODO: add WL characters to CHARACTER_TO_NAME. For example, "\uF74C" in WMA is named as
+# \[DifferentialD]. Here we are using "\U0001D451" for that name, because is a character
+# we can print with standard fonts. The problem with this approach is that the map
+# would not be invertible anymore. 


I've been thinking about this general topic, and here is my current thinking.

I am pretty sure that Mathics3's $CharacterEncoding for UTF-8 is different from WMAs. In particular, WMA has some definitions in the user-defined Unicode space that we don't follow. I think there are other symbols where we've decided that the glyph used by WMA doesn't look right.

We could make those two align and change Mathics3's default $CharacterEncoding value to something else.

Related to this, I see that WMA defines seven encodings: Mathematica1 to Mathematica7. I suppose we could do something like this, too.

There seems to be a desire to have an encoding that does not map a symbol like ASCII "x" to more than one letter or named character. Strictly speaking, this is impossible to accomplish with an 8-bit code, like ASCII, or any of the ISO8859 character encodings, which are defined to be 8-bit representations.

For reference, UTF-8 is not a fixed 8-bit code; it is a variable-length code with a minimum of 8 bits. That's why it is possible to give all WMA named characters distinct values. (Whether we do or want to do is a different story).

Previously, I had said that WMA's CharacterEncoding is equivalent to the older concept of a "Code Page" that was popular in computers. I now realize this is not quite right because a codepage encodes a fixed number of bytes (usually one, but sometimes two bytes). CharacterEncoding does this, but it also handles Unicode and possibly other variable-length encodings (if that is a thing.)

So coming back for things to think and do (and when to do):

What should our default $CharacterEncoding be? If we call it UTF-8, what kind of UTF-8? What we use now, or what WMA uses? If it is WMA, what do we call what we do now?
Does WMA's UTF-8 define different character codes for all of named characters? We have seen that operators can have several representations or that two operators can map to the same symbol. For example, the multiplication operator can be either a space or some other symbol. And both Optional and Pattern operators map to "?".

rocky · 2026-03-29T22:04:33Z

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

mmatera · 2026-03-30T02:31:21Z

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

Changes in YAML files were included in #170, and it seems they pass. The problem seems to be in the Python code, but I still do not understand why.

rocky · 2026-03-30T09:01:55Z

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

Changes in YAML files were included in #170, and it seems they pass. The problem seems to be in the Python code, but I still do not understand why.

When I looked at briefly, no problem jump out. I suspect it has to do with the data produced.

…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables

…already loaded.

mmatera · 2026-04-02T18:32:40Z

@rocky, after doing some tests, i verified that the Mathics3-doctest workflow fails even if I comment out the new code. Maybe some changes in setuptools that we have not adverted.

…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables

mmatera · 2026-04-03T14:30:33Z

@rocky, now it seems this is ready. Any comments?

rocky · 2026-04-03T14:42:36Z

@rocky, now it seems this is ready. Any comments?

This is fine for now.

However, we haven't resolved how we want to address the larger issues:

What does $CharacterEncoding="UTF-8" mean in Mathics3? (Is it the same or is it different from WMA)
If the different, then what $CharacterEncoding value is used WMA (with user-space Unicode)? If the same, then what $CharacterEncoding value is used for the common Unicode set that most people have installed (no WMA user space encodings)?

The answer to these questions is probably going to influence these tables.

However, we should probably address this after release, and we'll have another API incompatible breaking release.

mmatera · 2026-04-03T16:22:16Z

@rocky, now it seems this is ready. Any comments?

This is fine for now.

However, we haven't resolved how we want to address the larger issues:
1. What does `$CharacterEncoding="UTF-8"` mean in Mathics3? (Is it the same or is it different from WMA)

By now, I would identify the internal representation with UTF-8 and with the slighly modified WMA default charset,
where "\[Integral]" is "\0x222b" while \[DifferentialD] is "0x1d451" instead of "0xf74c". I also would like to allow the parser to pick both characters ("0x1d451" and "0xf74c") and treat them as \[DifferentialD] in expressions.

Notice that when WMA produce text to share, it uses the "reversible" ASCII form (\[DifferentialD]).

2. If the different, then what `$CharacterEncoding` value is used WMA (with user-space Unicode)? If the same, then what `$CharacterEncoding` value is used for the common Unicode set that most people have installed (no WMA user space encodings)?

In the version I have,

In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= "\[Integral]F[x]\[DifferentialD]x"                                      

Out[2]= ∫F[x]x

The answer to these questions is probably going to influence these tables.

However, we should probably address this after release, and we'll have another API incompatible breaking release.

These are different things: one is the API, and the other is how internally we represent some special characters. We can change the second, keeping the API untouched.

rocky · 2026-04-03T16:33:39Z

@rocky, now it seems this is ready. Any comments?

This is fine for now.
However, we haven't resolved how we want to address the larger issues:
1. What does `$CharacterEncoding="UTF-8"` mean in Mathics3? (Is it the same or is it different from WMA)
By now, I would identify the internal representation with UTF-8 and with the slighly modified WMA default charset, where "\[Integral]" is "\0x222b" while \[DifferentialD] is "0x1d451" instead of "0xf74c". I also would like to allow the parser to pick both characters ("0x1d451" and "0xf74c") and treat them as \[DifferentialD] in expressions.

This was addressed a while back. If not, open an issue.

Notice that when WMA produce text to share, it uses the "reversible" ASCII form (\[DifferentialD]).

This is a bit vague, and I fear it is masking some sort of misunderstanding.

2. If the different, then what `$CharacterEncoding` value is used WMA (with user-space Unicode)? If the same, then what `$CharacterEncoding` value is used for the common Unicode set that most people have installed (no WMA user space encodings)?
In the version I have,
In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= "\[Integral]F[x]\[DifferentialD]x"                                      

Out[2]= ∫F[x]x
The answer to these questions is probably going to influence these tables.
However, we should probably address this after release, and we'll have another API incompatible breaking release.

These are different things: one is the API, and the other is how internally we represent some special characters. We can change the second, keeping the API untouched.

There still is a basic misunderstanding. There is no "internal character representation". The internal representation is a Parse tree and then a Mathics3 {S,M}-expression. Turning expressions into a string is determined by the Form and the CharacterEncoding.

mmatera · 2026-04-03T16:56:07Z

With "internal representation" I mean the internal representation of a character. Let's say, if you input \[Integral] inside a string expression, it is translated, disregarding the encoding, to "\0x222b":

In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= s="\[Integral]"                                                         

Out[2]= ∫

In[3]:= ToCharacterCode[s]                                                      

Out[3]= {8747}

In[4]:= $CharacterEncoding="ASCII"                                              

Out[4]= ASCII

In[5]:= s2="\[Integral]"                                                        

Out[5]= \[Integral]

In[6]:= ToCharacterCode[s2]                                                     

Out[6]= {8747}

And this works in this way, until we reach the "render" step, in which that character is converted into the sequence of characters that better match with what \[Integral] represents, in the current encoding.

add UNICODE_CHARACTER_TO_ASCII and CHARACTER_TO_NAME tables

7ed9e9f

mmatera commented Mar 29, 2026

View reviewed changes

using the variable

582a161

rocky reviewed Mar 29, 2026

View reviewed changes

mmatera added 10 commits April 2, 2026 14:16

Merge remote-tracking branch 'origin/master' into add_UNICODE_CHARACT…

517d49c

…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables

black

44ba4f3

add operator-to-unicode and operator-to-ascii just if the tables are …

cd524c9

…already loaded.

install mathics3 in pyodide workflow

5bf07a4

no isolation in mathics3-doctest workflow

aad225f

no build isolation for mathics3-doctst

95d18ee

undo changes

b01e703

try what happens if we comment out the new code...

aed4e80

try what happens if we comment out the new code...

eaa7a03

resinstate changes

7024b8d

Merge remote-tracking branch 'origin/master' into add_UNICODE_CHARACT…

5fc366d

…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables

mmatera merged commit 32c98ad into master Apr 3, 2026
12 checks passed

mmatera deleted the add_UNICODE_CHARACTER_TO_ASCII_and_CHARACTER_TO_NAME_tables branch April 3, 2026 16:22

Uh oh!

Conversation

mmatera commented Mar 29, 2026

Uh oh!

mmatera Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

mmatera Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

rocky Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky commented Mar 29, 2026

Uh oh!

mmatera commented Mar 30, 2026

Uh oh!

rocky commented Mar 30, 2026

Uh oh!

mmatera commented Apr 2, 2026

Uh oh!

mmatera commented Apr 3, 2026

Uh oh!

rocky commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmatera commented Apr 3, 2026

Uh oh!

Uh oh!

rocky commented Apr 3, 2026

Uh oh!

mmatera commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rocky Mar 29, 2026 •

edited

Loading

rocky commented Apr 3, 2026 •

edited

Loading