Skip to content

add UNICODE_CHARACTER_TO_ASCII and CHARACTER_TO_NAME tables#169

Merged
mmatera merged 13 commits intomasterfrom
add_UNICODE_CHARACTER_TO_ASCII_and_CHARACTER_TO_NAME_tables
Apr 3, 2026
Merged

add UNICODE_CHARACTER_TO_ASCII and CHARACTER_TO_NAME tables#169
mmatera merged 13 commits intomasterfrom
add_UNICODE_CHARACTER_TO_ASCII_and_CHARACTER_TO_NAME_tables

Conversation

@mmatera
Copy link
Copy Markdown
Contributor

@mmatera mmatera commented Mar 29, 2026

This PR adds the tables we need to use in Mathics-core to implement the ASCII encodings.

CapitalDifferentialD:
amslatex: '\mathbb{D}'
ascii: "d"
ascii: "D"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the symbol used in path integrals, with a capital D in the differential.

# \u2146
DifferentialD:
amslatex: '\,d'
ascii: "d"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing.

# TODO: add WL characters to CHARACTER_TO_NAME. For example, "\uF74C" in WMA is named as
# \[DifferentialD]. Here we are using "\U0001D451" for that name, because is a character
# we can print with standard fonts. The problem with this approach is that the map
# would not be invertible anymore.
Copy link
Copy Markdown
Member

@rocky rocky Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this general topic, and here is my current thinking.

I am pretty sure that Mathics3's $CharacterEncoding for UTF-8 is different from WMAs. In particular, WMA has some definitions in the user-defined Unicode space that we don't follow. I think there are other symbols where we've decided that the glyph used by WMA doesn't look right.

We could make those two align and change Mathics3's default $CharacterEncoding value to something else.

Related to this, I see that WMA defines seven encodings: Mathematica1 to Mathematica7. I suppose we could do something like this, too.

There seems to be a desire to have an encoding that does not map a symbol like ASCII "x" to more than one letter or named character. Strictly speaking, this is impossible to accomplish with an 8-bit code, like ASCII, or any of the ISO8859 character encodings, which are defined to be 8-bit representations.

For reference, UTF-8 is not a fixed 8-bit code; it is a variable-length code with a minimum of 8 bits. That's why it is possible to give all WMA named characters distinct values. (Whether we do or want to do is a different story).

Previously, I had said that WMA's CharacterEncoding is equivalent to the older concept of a "Code Page" that was popular in computers. I now realize this is not quite right because a codepage encodes a fixed number of bytes (usually one, but sometimes two bytes). CharacterEncoding does this, but it also handles Unicode and possibly other variable-length encodings (if that is a thing.)

So coming back for things to think and do (and when to do):

What should our default $CharacterEncoding be? If we call it UTF-8, what kind of UTF-8? What we use now, or what WMA uses? If it is WMA, what do we call what we do now?
Does WMA's UTF-8 define different character codes for all of named characters? We have seen that operators can have several representations or that two operators can map to the same symbol. For example, the multiplication operator can be either a space or some other symbol. And both Optional and Pattern operators map to "?".

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 29, 2026

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 30, 2026

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

Changes in YAML files were included in #170, and it seems they pass. The problem seems to be in the Python code, but I still do not understand why.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 30, 2026

@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in mathics_scanner/characters.py. (And if there is a problem changing the YAML for these two entries, that's interesting too.)

Changes in YAML files were included in #170, and it seems they pass. The problem seems to be in the Python code, but I still do not understand why.

When I looked at briefly, no problem jump out. I suspect it has to do with the data produced.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Apr 2, 2026

@rocky, after doing some tests, i verified that the Mathics3-doctest workflow fails even if I comment out the new code. Maybe some changes in setuptools that we have not adverted.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Apr 3, 2026

@rocky, now it seems this is ready. Any comments?

@rocky
Copy link
Copy Markdown
Member

rocky commented Apr 3, 2026

@rocky, now it seems this is ready. Any comments?

This is fine for now.

However, we haven't resolved how we want to address the larger issues:

  1. What does $CharacterEncoding="UTF-8" mean in Mathics3? (Is it the same or is it different from WMA)
  2. If the different, then what $CharacterEncoding value is used WMA (with user-space Unicode)? If the same, then what $CharacterEncoding value is used for the common Unicode set that most people have installed (no WMA user space encodings)?

The answer to these questions is probably going to influence these tables.

However, we should probably address this after release, and we'll have another API incompatible breaking release.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Apr 3, 2026

@rocky, now it seems this is ready. Any comments?

This is fine for now.

However, we haven't resolved how we want to address the larger issues:

1. What does `$CharacterEncoding="UTF-8"` mean in Mathics3? (Is it the same or is it different from WMA)

By now, I would identify the internal representation with UTF-8 and with the slighly modified WMA default charset,
where "\[Integral]" is "\0x222b" while \[DifferentialD] is "0x1d451" instead of "0xf74c". I also would like to allow the parser to pick both characters ("0x1d451" and "0xf74c") and treat them as \[DifferentialD] in expressions.

Notice that when WMA produce text to share, it uses the "reversible" ASCII form (\[DifferentialD]).

2. If the different, then what `$CharacterEncoding` value is used WMA (with user-space Unicode)? If the same, then what `$CharacterEncoding` value is used for the common Unicode set that most people have installed (no WMA user space encodings)?

In the version I have,

In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= "\[Integral]F[x]\[DifferentialD]x"                                      

Out[2]= ∫F[x]x

The answer to these questions is probably going to influence these tables.

However, we should probably address this after release, and we'll have another API incompatible breaking release.

These are different things: one is the API, and the other is how internally we represent some special characters. We can change the second, keeping the API untouched.

@mmatera mmatera merged commit 32c98ad into master Apr 3, 2026
12 checks passed
@mmatera mmatera deleted the add_UNICODE_CHARACTER_TO_ASCII_and_CHARACTER_TO_NAME_tables branch April 3, 2026 16:22
@rocky
Copy link
Copy Markdown
Member

rocky commented Apr 3, 2026

@rocky, now it seems this is ready. Any comments?

This is fine for now.
However, we haven't resolved how we want to address the larger issues:

1. What does `$CharacterEncoding="UTF-8"` mean in Mathics3? (Is it the same or is it different from WMA)

By now, I would identify the internal representation with UTF-8 and with the slighly modified WMA default charset, where "\[Integral]" is "\0x222b" while \[DifferentialD] is "0x1d451" instead of "0xf74c". I also would like to allow the parser to pick both characters ("0x1d451" and "0xf74c") and treat them as \[DifferentialD] in expressions.

This was addressed a while back. If not, open an issue.

Notice that when WMA produce text to share, it uses the "reversible" ASCII form (\[DifferentialD]).

This is a bit vague, and I fear it is masking some sort of misunderstanding.

2. If the different, then what `$CharacterEncoding` value is used WMA (with user-space Unicode)? If the same, then what `$CharacterEncoding` value is used for the common Unicode set that most people have installed (no WMA user space encodings)?

In the version I have,

In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= "\[Integral]F[x]\[DifferentialD]x"                                      

Out[2]= ∫F[x]x

The answer to these questions is probably going to influence these tables.
However, we should probably address this after release, and we'll have another API incompatible breaking release.

These are different things: one is the API, and the other is how internally we represent some special characters. We can change the second, keeping the API untouched.

There still is a basic misunderstanding. There is no "internal character representation". The internal representation is a Parse tree and then a Mathics3 {S,M}-expression. Turning expressions into a string is determined by the Form and the CharacterEncoding.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Apr 3, 2026

With "internal representation" I mean the internal representation of a character. Let's say, if you input \[Integral] inside a string expression, it is translated, disregarding the encoding, to "\0x222b":

In[1]:= $CharacterEncoding                                                      

Out[1]= UTF-8

In[2]:= s="\[Integral]"                                                         

Out[2]= ∫

In[3]:= ToCharacterCode[s]                                                      

Out[3]= {8747}

In[4]:= $CharacterEncoding="ASCII"                                              

Out[4]= ASCII

In[5]:= s2="\[Integral]"                                                        

Out[5]= \[Integral]

In[6]:= ToCharacterCode[s2]                                                     

Out[6]= {8747}

And this works in this way, until we reach the "render" step, in which that character is converted into the sequence of characters that better match with what \[Integral] represents, in the current encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants