From c123f4dee14312061b65cde7cf71133c50820137 Mon Sep 17 00:00:00 2001 From: Brian Ward Date: Fri, 27 Aug 2021 12:27:14 -0400 Subject: [PATCH 1/3] Update encoding notices for string literals --- src/functions-reference/void_functions.Rmd | 5 ++--- src/reference-manual/encoding.Rmd | 12 ++++++++++++ src/reference-manual/statements.Rmd | 14 ++++---------- 3 files changed, 18 insertions(+), 13 deletions(-) diff --git a/src/functions-reference/void_functions.Rmd b/src/functions-reference/void_functions.Rmd index 78969b56e..f2a7eadb3 100644 --- a/src/functions-reference/void_functions.Rmd +++ b/src/functions-reference/void_functions.Rmd @@ -39,8 +39,7 @@ Print the values denoted by the arguments x1 through xN on the output message stream. There are no spaces between items in the print, but a line feed (LF; Unicode U+000A; C++ literal `'\n'`) is inserted at the end of the printed line. The types `T1` through `TN` can be any of -Stan's built-in numerical types or double quoted strings of ASCII -characters. +Stan's built-in numerical types or double quoted strings of characters. ## Reject statement @@ -60,5 +59,5 @@ arguments x1 through xN on the output message stream. There are no spaces between items in the print, but a line feed (LF; Unicode U+000A; C++ literal `'\n'`) is inserted at the end of the printed line. The types `T1` through `TN` can be any of Stan's built-in -numerical types or double quoted strings of ASCII characters. +numerical types or double quoted strings of characters. diff --git a/src/reference-manual/encoding.Rmd b/src/reference-manual/encoding.Rmd index bcf2ed855..4828f68dd 100644 --- a/src/reference-manual/encoding.Rmd +++ b/src/reference-manual/encoding.Rmd @@ -25,3 +25,15 @@ is convenient. Any content after a block comment open sequence in ASCII (`/*`) up to the closing block comment (`*/`) is ignored, and thus may also be written in whatever character set is convenient. + +## String literals + +String literals are escaped according to the C++ standard, +meaning that non-ASCII characters in a `print` or `reject` +statement should properly be displayed if your terminal supports +the encoding used in the input. This has been tested with UTF-8 +encoded characters on a compliant terminal, and may not work under +other conditions. + +The recommended character encoding for portable code that should +display properly on all systems is still ASCII. \ No newline at end of file diff --git a/src/reference-manual/statements.Rmd b/src/reference-manual/statements.Rmd index 56e965fa0..5f2f39418 100644 --- a/src/reference-manual/statements.Rmd +++ b/src/reference-manual/statements.Rmd @@ -1321,17 +1321,11 @@ step, and the `generated quantities` block once per iteration. String literals begin and end with a double quote character (`"`). The characters between the double quote characters may be -the space character or any visible ASCII character, with the exception -of the backslash character (`\`) and double quote character -(`"`). The full list of visible ASCII characters is as follows, - -``` -a b c d e f g h i j k l m n o p q r s t u v w x y z -A B C D E F G H I J K L M N O P Q R S T U V W X Y Z -0 1 2 3 4 5 6 7 8 9 0 { } [ ] ( ) < > -~ @ # $ ` ^ & * _ ' - + = | / ! ? . , ; : -``` +any character, with the exception of the double quote character. +Characters outside the ASCII character set will be escaped and +passed to C++ as encoded. The behavior of these strings may depend +on your interface's encoding settings. ### Debug by `print` {-} From 2f02f0a7effb8e71919fccce3bc95dddf6ceb5d2 Mon Sep 17 00:00:00 2001 From: Brian Ward Date: Mon, 30 Aug 2021 10:15:41 -0400 Subject: [PATCH 2/3] Be more clear on character/byte distinction --- src/functions-reference/void_functions.Rmd | 6 +++--- src/reference-manual/encoding.Rmd | 14 +++++++------ src/reference-manual/statements.Rmd | 23 +++++++++++----------- 3 files changed, 23 insertions(+), 20 deletions(-) diff --git a/src/functions-reference/void_functions.Rmd b/src/functions-reference/void_functions.Rmd index f2a7eadb3..b55d5e97f 100644 --- a/src/functions-reference/void_functions.Rmd +++ b/src/functions-reference/void_functions.Rmd @@ -39,7 +39,8 @@ Print the values denoted by the arguments x1 through xN on the output message stream. There are no spaces between items in the print, but a line feed (LF; Unicode U+000A; C++ literal `'\n'`) is inserted at the end of the printed line. The types `T1` through `TN` can be any of -Stan's built-in numerical types or double quoted strings of characters. +Stan's built-in numerical types or double quoted strings of characters +(bytes). ## Reject statement @@ -59,5 +60,4 @@ arguments x1 through xN on the output message stream. There are no spaces between items in the print, but a line feed (LF; Unicode U+000A; C++ literal `'\n'`) is inserted at the end of the printed line. The types `T1` through `TN` can be any of Stan's built-in -numerical types or double quoted strings of characters. - +numerical types or double quoted strings of characters (bytes). diff --git a/src/reference-manual/encoding.Rmd b/src/reference-manual/encoding.Rmd index 4828f68dd..6dd9d59ae 100644 --- a/src/reference-manual/encoding.Rmd +++ b/src/reference-manual/encoding.Rmd @@ -28,12 +28,14 @@ also be written in whatever character set is convenient. ## String literals -String literals are escaped according to the C++ standard, -meaning that non-ASCII characters in a `print` or `reject` -statement should properly be displayed if your terminal supports -the encoding used in the input. This has been tested with UTF-8 -encoded characters on a compliant terminal, and may not work under -other conditions. +String literals are escaped according to the C++ standard. +In particular, this means that bytes outside of the ASCII character +range in a `print` or `reject` statement should properly be displayed +if your terminal supports the encoding used in the input. In other +words, Stan simply preserves any string of bytes between two double +quotes (`"`) when passing to C++. On compliant terminals, this allows +the use of glyphs and other characters from encodings such as UTF-8 that +fall outside the ASCII-compatible range. The recommended character encoding for portable code that should display properly on all systems is still ASCII. \ No newline at end of file diff --git a/src/reference-manual/statements.Rmd b/src/reference-manual/statements.Rmd index 5f2f39418..81b44c2b9 100644 --- a/src/reference-manual/statements.Rmd +++ b/src/reference-manual/statements.Rmd @@ -211,14 +211,14 @@ statements of the forms listed in the table above. The compound form is legal whenever the corresponding long form would be legal and it has the same effect.* - operation | compound | unfolded -:-----------|:------------|:------------- -addition | `x += y` | `x = x + y` -subtraction | `x -= y` | `x = x - y` -multiplication | `x *= y` | `x = x * y` -division | `x /= y` | `x = x / y` -elementwise multiplication | `x .*= y` | `x = x .* y` -elementwise division | `x ./= y` | `x = x ./ y` + | operation | compound | unfolded | + | :------------------------- | :-------- | :----------- | + | addition | `x += y` | `x = x + y` | + | subtraction | `x -= y` | `x = x - y` | + | multiplication | `x *= y` | `x = x * y` | + | division | `x /= y` | `x = x / y` | + | elementwise multiplication | `x .*= y` | `x = x .* y` | + | elementwise division | `x ./= y` | `x = x ./ y` | ## Increment log density {#increment-log-prob.section} @@ -1323,9 +1323,10 @@ String literals begin and end with a double quote character (`"`). The characters between the double quote characters may be any character, with the exception of the double quote character. -Characters outside the ASCII character set will be escaped and -passed to C++ as encoded. The behavior of these strings may depend -on your interface's encoding settings. +Bytes with values greater than 127 (outside the ASCII character set) +appearing in string literals will be escaped and passed to C++. +The behavior of these strings may depend on your interface's encoding +settings. ### Debug by `print` {-} From b00f29c2607b182f84db8823bf829f7fc57ebd51 Mon Sep 17 00:00:00 2001 From: Brian Ward Date: Tue, 31 Aug 2021 11:20:45 -0400 Subject: [PATCH 3/3] Changes per review --- src/reference-manual/encoding.Rmd | 24 +++++++++++++----------- src/reference-manual/statements.Rmd | 13 ++++++++----- 2 files changed, 21 insertions(+), 16 deletions(-) diff --git a/src/reference-manual/encoding.Rmd b/src/reference-manual/encoding.Rmd index 6dd9d59ae..5ae2a6bb9 100644 --- a/src/reference-manual/encoding.Rmd +++ b/src/reference-manual/encoding.Rmd @@ -28,14 +28,16 @@ also be written in whatever character set is convenient. ## String literals -String literals are escaped according to the C++ standard. -In particular, this means that bytes outside of the ASCII character -range in a `print` or `reject` statement should properly be displayed -if your terminal supports the encoding used in the input. In other -words, Stan simply preserves any string of bytes between two double -quotes (`"`) when passing to C++. On compliant terminals, this allows -the use of glyphs and other characters from encodings such as UTF-8 that -fall outside the ASCII-compatible range. - -The recommended character encoding for portable code that should -display properly on all systems is still ASCII. \ No newline at end of file +The raw byte sequence within a string literal is escaped according +to the C++ standard. In particular, this means that UTF-8 encoded +strings are supported, however they are not tested for invalid byte +sequences. A `print` or `reject` statement should properly display +Unicode characters if your terminal supports the encoding used in the +input. In other words, Stan simply preserves any string of bytes between +two double quotes (`"`) when passing to C++. On compliant terminals, +this allows the use of glyphs and other characters from encodings such as +UTF-8 that fall outside the ASCII-compatible range. + +ASCII is the recommended encoding for maximum portability, because it encodes +the ASCII characters (Unicode code points 0--127) using the same sequence of +bytes as the UTF-8 encoding of Unicode and common ISO-8859 extensions of Latin. \ No newline at end of file diff --git a/src/reference-manual/statements.Rmd b/src/reference-manual/statements.Rmd index 81b44c2b9..f31824a9e 100644 --- a/src/reference-manual/statements.Rmd +++ b/src/reference-manual/statements.Rmd @@ -1321,12 +1321,15 @@ step, and the `generated quantities` block once per iteration. String literals begin and end with a double quote character (`"`). The characters between the double quote characters may be -any character, with the exception of the double quote character. +any byte sequence, with the exception of the double quote character. + +The Stan interfaces preserve the byte sequences which they receive. +The encoding of these byte sequences as characters and their rendering +as glyphs will be handled by whatever display mechanism is being used to +monitor Stan's output (e.g., a terminal, a Jupyter notebook, RStudio, etc.). +Stan does not enforce a character encoding for strings, and no attempt is +made to validate the bytes as legal ASCII, UTF-8, etc. -Bytes with values greater than 127 (outside the ASCII character set) -appearing in string literals will be escaped and passed to C++. -The behavior of these strings may depend on your interface's encoding -settings. ### Debug by `print` {-}