Conversation
| CapitalDifferentialD: | ||
| amslatex: '\mathbb{D}' | ||
| ascii: "d" | ||
| ascii: "D" |
There was a problem hiding this comment.
This is the symbol used in path integrals, with a capital D in the differential.
| # \u2146 | ||
| DifferentialD: | ||
| amslatex: '\,d' | ||
| ascii: "d" |
mathics_scanner/characters.py
Outdated
| # TODO: add WL characters to CHARACTER_TO_NAME. For example, "\uF74C" in WMA is named as | ||
| # \[DifferentialD]. Here we are using "\U0001D451" for that name, because is a character | ||
| # we can print with standard fonts. The problem with this approach is that the map | ||
| # would not be invertible anymore. |
There was a problem hiding this comment.
I've been thinking about this general topic, and here is my current thinking.
I am pretty sure that Mathics3's $CharacterEncoding for UTF-8 is different from WMAs. In particular, WMA has some definitions in the user-defined Unicode space that we don't follow. I think there are other symbols where we've decided that the glyph used by WMA doesn't look right.
We could make those two align and change Mathics3's default $CharacterEncoding value to something else.
Related to this, I see that WMA defines seven encodings: Mathematica1 to Mathematica7. I suppose we could do something like this, too.
There seems to be a desire to have an encoding that does not map a symbol like ASCII "x" to more than one letter or named character. Strictly speaking, this is impossible to accomplish with an 8-bit code, like ASCII, or any of the ISO8859 character encodings, which are defined to be 8-bit representations.
For reference, UTF-8 is not a fixed 8-bit code; it is a variable-length code with a minimum of 8 bits. That's why it is possible to give all WMA named characters distinct values. (Whether we do or want to do is a different story).
Previously, I had said that WMA's CharacterEncoding is equivalent to the older concept of a "Code Page" that was popular in computers. I now realize this is not quite right because a codepage encodes a fixed number of bytes (usually one, but sometimes two bytes). CharacterEncoding does this, but it also handles Unicode and possibly other variable-length encodings (if that is a thing.)
So coming back for things to think and do (and when to do):
What should our default $CharacterEncoding be? If we call it UTF-8, what kind of UTF-8? What we use now, or what WMA uses? If it is WMA, what do we call what we do now?
Does WMA's UTF-8 define different character codes for all of named characters? We have seen that operators can have several representations or that two operators can map to the same symbol. For example, the multiplication operator can be either a space or some other symbol. And both Optional and Pattern operators map to "?".
|
@mmatera: I looked briefly at the CI failures and verified that it is not the case that master is broken. A simple suggestion is to just try the changes to named-characters.yml, and then come back to the Python changes in |
Changes in YAML files were included in #170, and it seems they pass. The problem seems to be in the Python code, but I still do not understand why. |
When I looked at briefly, no problem jump out. I suspect it has to do with the data produced. |
…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables
|
@rocky, after doing some tests, i verified that the Mathics3-doctest workflow fails even if I comment out the new code. Maybe some changes in setuptools that we have not adverted. |
…ER_TO_ASCII_and_CHARACTER_TO_NAME_tables
|
@rocky, now it seems this is ready. Any comments? |
This is fine for now. However, we haven't resolved how we want to address the larger issues:
The answer to these questions is probably going to influence these tables. However, we should probably address this after release, and we'll have another API incompatible breaking release. |
By now, I would identify the internal representation with UTF-8 and with the slighly modified WMA default charset, Notice that when WMA produce text to share, it uses the "reversible" ASCII form (
In the version I have,
These are different things: one is the API, and the other is how internally we represent some special characters. We can change the second, keeping the API untouched. |
This was addressed a while back. If not, open an issue.
This is a bit vague, and I fear it is masking some sort of misunderstanding.
There still is a basic misunderstanding. There is no "internal character representation". The internal representation is a Parse tree and then a Mathics3 {S,M}-expression. Turning expressions into a string is determined by the Form and the CharacterEncoding. |
|
With "internal representation" I mean the internal representation of a character. Let's say, if you input And this works in this way, until we reach the "render" step, in which that character is converted into the sequence of characters that better match with what |
This PR adds the tables we need to use in Mathics-core to implement the ASCII encodings.