Skip to content

Support octal escape sequences.#100

Open
kaos wants to merge 1 commit into
arithy:mainfrom
kaos:octal_escape
Open

Support octal escape sequences.#100
kaos wants to merge 1 commit into
arithy:mainfrom
kaos:octal_escape

Conversation

@kaos

@kaos kaos commented Jun 15, 2026

Copy link
Copy Markdown

I've also verified this change by looking at the generated parser rule code.

Diff for parser.c when using the octal character class before/after this change:

    PCC_DEBUG(ctx->auxil, PCC_DBG_EVALUATE, "UPPER_LETTER", ctx->level, chunk->pos, ctx->buffer.p + chunk->pos, ctx->buffer.n - chunk->pos);
    ctx->level++;
    {
        int u;
        const size_t n = pcc_get_char_as_utf32(ctx, &u);
        if (n == 0) goto L0000;
        if (!(
-            u == 0x000031 ||
-            u == 0x000030 ||
-            (u >= 0x000031 && u <= 0x000031) ||
-            u == 0x000033 ||
-            u == 0x000032
+            (u >= 0x000041 && u <= 0x00005a)
        )) goto L0000;
        ctx->cur += n;
    }

@arithy arithy closed this Jun 15, 2026
@arithy arithy reopened this Jun 15, 2026
@arithy

arithy commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Hi @kaos , thank you very much for your contribution.
I want to accept this PR. However, it fails the test null.d.
So, I would be thankful if you could fix it.
(Initially I misread your implementation and closed the PR once..., sorry.)

@kaos

kaos commented Jun 16, 2026

Copy link
Copy Markdown
Author

@arithy

Hi, certainly!

I thought I ran all the tests before submitting, but I obviously failed to do so 🤦🏽

Looking at the null.d test, there are escape sequences on the form \0123. In the docs you say you support ANSI C escape codes, and in (this was the best freely available one I could find) https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf it is stated in 6.4.4.4 section 7:

Each octal or hexadecimal escape sequence is the longest sequence of characters that can
constitute the escape sequence.

So the previous escape should be parsed into two characters: \012 and 3. Before this change, it was parsed into four: \0, 1, 2 and 3.

It would seem that this is a breaking change, but one that follows the ISO C specification. Not sure how you would prefer to resolve this?

(I'll take this opportunity to note that it seems PackCC enforces the use of two hex digits for \x escape sequences as well, which made me raise an eyebrow as well, as \x5 is a valid escape according to ISO C.)

This diff to the tests make them pass again (by adding a x0 at the affected places):

--- a/tests/null.d/input.peg
+++ b/tests/null.d/input.peg
@@ -6,8 +6,8 @@ CHAR_CLASS_0 <- "char_class_0_a:" [abc\0-!123]+ { printf("CHAR_CLASS_0_A\n"); }
 CHAR_CLASS_1 <- "char_class_1_a:" [abc\0]+ { printf("CHAR_CLASS_1_A\n"); } / "char_class_1_b:" [abc\x00]+ { printf("CHAR_CLASS_1_B\n"); }
 CHAR_CLASS_2 <- "char_class_2_a:" [\0-!]+ { printf("CHAR_CLASS_2_A\n"); } / "char_class_2_b:" [\x00-!]+ { printf("CHAR_CLASS_2_B\n"); }
 CHAR_CLASS_3 <- "char_class_3_a:" [\0]+ { printf("CHAR_CLASS_3_A\n"); } / "char_class_3_b:" [\x00]+ { printf("CHAR_CLASS_3_B\n"); }
-STRING_0 <- "string_0_a:" "abc\0123" { printf("STRING_0_A\n"); } / "string_0_b:" "abc\x00123" { printf("STRING_0_B\n"); }
+STRING_0 <- "string_0_a:" "abc\x00123" { printf("STRING_0_A\n"); } / "string_0_b:" "abc\x00123" { printf("STRING_0_B\n"); }
 STRING_1 <- "string_1_a:" "abc\0" { printf("STRING_1_A\n"); } / "string_1_b:" "abc\x00" { printf("STRING_1_B\n"); }
-STRING_2 <- "string_2_a:" "\0123" { printf("STRING_2_A\n"); } / "string_2_b:" "\x00123" { printf("STRING_2_B\n"); }
+STRING_2 <- "string_2_a:" "\x00123" { printf("STRING_2_A\n"); } / "string_2_b:" "\x00123" { printf("STRING_2_B\n"); }
 STRING_3 <- "string_3_a:" "\0" { printf("STRING_3_A\n"); } / "string_3_b:" "\x00" { printf("STRING_3_B\n"); }
-CAPTURED <- "captured_a:" < CHAR_CLASS_0 "xyz\0123" > "|" $1 { printf("CAPTURED_A\n"); } / "captured_b:" < CHAR_CLASS_0 "xyz\x00123" > "|" $2 { printf("CAPTURED_B\n"); }
+CAPTURED <- "captured_a:" < CHAR_CLASS_0 "xyz\x00123" > "|" $1 { printf("CAPTURED_A\n"); } / "captured_b:" < CHAR_CLASS_0 "xyz\x00123" > "|" $2 { printf("CAPTURED_B\n"); }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants