-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Summary of issue:
We currently use the ref token as an introducer for three language constructs with very different semantics:
refbinding patterns, as in(ref x: i32, ref y: i32).- Declaring reference components of a function return form, such as
fn F() -> (ref X, ref Y); - Marking arguments to
refparameters, such asF(ref x, ref y).
These aren't ambiguous in an absolute sense, but disambiguating them seems to require 3 tokens of lookahead (vs. the current maximum of 2), and may require unbounded lookahead under arguably-plausible future evolution.
Details:
We currently use the ref token as an introducer for three semantically different language constructs:
- A
refbinding pattern, e.g.ref x: i32. This can only occur in a pattern context. - A
refreturn form, e.g.-> (i32, ref i32). Currently, this can only appear in the right-hand side of->in a function declaration, but Generic across forms #5389 proposes to also allow it in a form literal (e.g.form(ref i32)). As of this writing, the proposal doesn't allow it to appear below the top level of a form literal (e.g.form((i32, ref i32))isn't allowed), but I'm not sure how durable that restriction will be. - A
refargument tag, e.g.F(x, ref y). At present, it can only appear as an element of a function call's explicit argument list (or an element of a tuple or struct literal in such a position, recursively, as with (2)), and for semantic reasons it currently can't appear in a valid return type expression, but at the 10-30 open discussion @zygoloid argued that we shouldn't bake either of those assumptions into our implementation strategy, and should be prepared to support code likefn I() -> (ref a, ref b) * 3;, where bothrefs are argument tags for the*operator.
(1) can be distinguished from (2) using syntactic context, because pattern contexts and return type expressions are always marked by an introducer (if that's not the case, we may have an even bigger problem with var). I believe (1) can be distinguished from (3) using at most 3 tokens of lookahead, where the worst case is distinguishing e.g. the binding pattern ref each x: i32 from the expression pattern ref each x. I don't think that's a problem for the toolchain, but it could be a problem for tools implemented using a parser generator with one token of lookahead (such as yacc and its descendants).
As best I can tell, (2) and (3) are not formally ambiguous, but if we want code like fn I() -> (ref a, ref b) * 3; to parse, I don't think they can be distinguished with any fixed amount of lookahead. For example, disambiguating code like fn F() -> (((ref x,),),) * 2 requires lookahead proportional to the number of parentheses.
I see a couple of options that only resolve the (1)/(3) ambiguity:
- Accept 3 tokens of lookahead in the toolchain, and hope that yacc-based tools can muddle through by tweaking the structure of the grammar.
- Lex an each-name "
eachidentifier" as a single token, so that we only need 2 tokens of lookahead. - Require expression patterns to have an introducer (perhaps
==as a unary prefix?). This would probably have substantial knock-on benefits for the implementation of the toolchain (and probably other tools).
I also see a couple of options that only resolve the (2)/(3) ambiguity:
- Disallow (3) in return type expressions, and/or in the operands of an operator. These are both the status quo for the language design, so we would just be affirming that the toolchain implementation can proceed on the assumption that one or both of them will not change.
- Accept the ambiguity. For the toolchain, that would mean representing (2) and (3) with the same kind of parse node, and disambiguating during checking. This would probably require some kind of two-phase checking for function return type expressions (in effect, unbounded lookahead, but during checking rather than parsing). It's possible we could rationalize this as part of the two-pass checking that we already do for patterns, since the return type expression behaves like a parameter pattern in some respects. I'm far from confident in that, though; the division of labor between the two passes is very different in the two cases.
Finally, I see one option that resolves both ambiguities simultaneously: change (3) to use a new keyword. My tentative suggestion would be out, e.g. F(x, out y). mut is more appealing in some ways, but it would be misleading in cases where the corresponding parameter has a const type, so I think it would only be viable if we don't require reference-to-const arguments to be tagged. obj or even addr might also be possibilities.
Any other information that you want to share?
No response