Skip to content

Ambiguity in ref grammar #6342

@geoffromer

Description

@geoffromer

Summary of issue:

We currently use the ref token as an introducer for three language constructs with very different semantics:

  1. ref binding patterns, as in (ref x: i32, ref y: i32).
  2. Declaring reference components of a function return form, such as fn F() -> (ref X, ref Y);
  3. Marking arguments to ref parameters, such as F(ref x, ref y).

These aren't ambiguous in an absolute sense, but disambiguating them seems to require 3 tokens of lookahead (vs. the current maximum of 2), and may require unbounded lookahead under arguably-plausible future evolution.

Details:

We currently use the ref token as an introducer for three semantically different language constructs:

  1. A ref binding pattern, e.g. ref x: i32. This can only occur in a pattern context.
  2. A ref return form, e.g. -> (i32, ref i32). Currently, this can only appear in the right-hand side of -> in a function declaration, but Generic across forms #5389 proposes to also allow it in a form literal (e.g. form(ref i32)). As of this writing, the proposal doesn't allow it to appear below the top level of a form literal (e.g. form((i32, ref i32)) isn't allowed), but I'm not sure how durable that restriction will be.
  3. A ref argument tag, e.g. F(x, ref y). At present, it can only appear as an element of a function call's explicit argument list (or an element of a tuple or struct literal in such a position, recursively, as with (2)), and for semantic reasons it currently can't appear in a valid return type expression, but at the 10-30 open discussion @zygoloid argued that we shouldn't bake either of those assumptions into our implementation strategy, and should be prepared to support code like fn I() -> (ref a, ref b) * 3;, where both refs are argument tags for the * operator.

(1) can be distinguished from (2) using syntactic context, because pattern contexts and return type expressions are always marked by an introducer (if that's not the case, we may have an even bigger problem with var). I believe (1) can be distinguished from (3) using at most 3 tokens of lookahead, where the worst case is distinguishing e.g. the binding pattern ref each x: i32 from the expression pattern ref each x. I don't think that's a problem for the toolchain, but it could be a problem for tools implemented using a parser generator with one token of lookahead (such as yacc and its descendants).

As best I can tell, (2) and (3) are not formally ambiguous, but if we want code like fn I() -> (ref a, ref b) * 3; to parse, I don't think they can be distinguished with any fixed amount of lookahead. For example, disambiguating code like fn F() -> (((ref x,),),) * 2 requires lookahead proportional to the number of parentheses.

I see a couple of options that only resolve the (1)/(3) ambiguity:

  • Accept 3 tokens of lookahead in the toolchain, and hope that yacc-based tools can muddle through by tweaking the structure of the grammar.
  • Lex an each-name "each identifier" as a single token, so that we only need 2 tokens of lookahead.
  • Require expression patterns to have an introducer (perhaps == as a unary prefix?). This would probably have substantial knock-on benefits for the implementation of the toolchain (and probably other tools).

I also see a couple of options that only resolve the (2)/(3) ambiguity:

  • Disallow (3) in return type expressions, and/or in the operands of an operator. These are both the status quo for the language design, so we would just be affirming that the toolchain implementation can proceed on the assumption that one or both of them will not change.
  • Accept the ambiguity. For the toolchain, that would mean representing (2) and (3) with the same kind of parse node, and disambiguating during checking. This would probably require some kind of two-phase checking for function return type expressions (in effect, unbounded lookahead, but during checking rather than parsing). It's possible we could rationalize this as part of the two-pass checking that we already do for patterns, since the return type expression behaves like a parameter pattern in some respects. I'm far from confident in that, though; the division of labor between the two passes is very different in the two cases.

Finally, I see one option that resolves both ambiguities simultaneously: change (3) to use a new keyword. My tentative suggestion would be out, e.g. F(x, out y). mut is more appealing in some ways, but it would be misleading in cases where the corresponding parameter has a const type, so I think it would only be viable if we don't require reference-to-const arguments to be tagged. obj or even addr might also be possibilities.

Any other information that you want to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    blocking workIndicates someone's work is currently blocked on the issue being resolved.leads questionA question for the leads team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions