Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 35 additions & 27 deletions docs/design/features/OnStackReplacement.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,15 +237,23 @@ pseudocode:
```
Patchpoint: // each assigned a dense set of IDs

if (++counter[ppID] > threshold) call PatchpointHelper(ppID)
if (++counter[ppID] > threshold)
{
var continuation = PatchpointHelper(ppID);
jmp continuation;
}
```
The helper can use the return address to determine which patchpoint is making
the request. To keep overheads manageable, we might instead want to down-count
and pass the counter address to the helper.
The helper can use the return address to determine which patchpoint is making the request.
The return address is also used in case we should continue without transitioning into an OSR method.
To keep overheads manageable, we might instead want to down-count and pass the counter address to the helper.
Comment on lines +240 to +248
Comment on lines +246 to +248
```
Patchpoint: // each assigned a dense set of IDs

if (--counter[ppID] <= 0) call PatchpointHelper(ppID, &counter[ppID])
if (--counter[ppID] <= 0)
{
var continuation = PatchpointHelper(ppID, &counter[ppID]);
jmp continuation;
}
```
The helper logic would be similar to the following:
```
Expand All @@ -259,20 +267,20 @@ PatchpointHelper(int ppID, int* counter)
case Unknown:
*counter = initialThreshold;
SetState(s, Active);
return;
return patchpointSite + <size of jmp>;

case Active:
*counter = checkThreshold;
SetState(s, Pending);
RequestAlternative(ppID);
return;
return patchpointSite + <size of jmp>;

case Pending:
*counter = checkThreshold;
return;
return patchpointSite + <size of jmp>;

case Ready:
Transition(...); // does not return
return <address of alternative>;
}
}
```
Expand Down Expand Up @@ -477,15 +485,13 @@ this is to just leave the original frame in place, and have the OSR frame

#### 3.4.1 Transition Implementation

The original method conditionally calls to the patchpoint helper at
patchpoints. The helper will return if there is no transition.
The original method conditionally calls to the patchpoint helper at patchpoints.
The helper returns a continuation address.
If transition is desired, this is the address of the alternative version.
Otherwise, it is the address in the tier0 code that follows the patchpoint helper call and jump instruction.

For a transition, the helper will capture context and virtually unwind itself
and the original method from the stack to recover callee-save register values
live into the original method and then restore the callee FP and SP values into
the context (preserving the original method frame); then set the context IP to
the OSR method entry and restore context. OSR method will incorporate the
original method frame as part of its frame.
After transitioning the OSR method will incorporate the original method frame as part of its frame.
This incorporation is slightly different between x64 and other targets. See below for more details.

## 4 Complications

Expand Down Expand Up @@ -658,23 +664,24 @@ prolog and duplicates its saves, and then a subsequent "shrink wrapped" prolog

#### Implementation

Callee-saves are currently handled sightly differently on x64
than it is on arm64:
* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. THe OSR method then restores this entire set on exit, with a single stack pointer adjustment. See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details.
* for arm64, the virtual unwind done by the runtime restores the Tier0 callee saves, so the OSR method saves and restores the full set of callee saves it uses, and then does a second stack pointer adjustment to pop the Tier0 frame.
Eventually we will revise arm64 to behave more like x64.
* float callee-saves are handled separately for tier0 and OSR methods; there is opportunity here to also share save space as we do for x64 integer registers,
but this might also lead to needlessly large tier0 frames.
Callee-saves are currently handled differently on x64 than it is on other targets:
* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame.
The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses.
The OSR method then restores this entire set on exit, with a single stack pointer adjustment.
See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details.
* for other targets the OSR method first restores the full set of callee saves saved by the tier0 version.
Its used callee saves are then saved and restored from the OSR part of the stack frame, in the same way as any normal prolog.
* For x64 we disallow the use of float callee-saves in the tier0 method.
Comment on lines +667 to +674
This avoids the need for special restore logic for float callee saves in the OSR method.
Comment on lines +672 to +675
For other platforms the handling of callee saves falls out naturally together with the integer register handling.
Comment on lines +674 to +676

You might think the runtime helper would need to carefully save all the register state
on entry, but that's not the case. Because the original method is un-optimized,
there isn't any live IL state in registers across the call to the patchpoint
helper&mdash;all the live IL state for the method is on the original
frame&mdash;so the argument and caller-save registers are dead at the
patchpoint. Thus only part of register state that is significant for ongoing
computation is the callee-saves, which are recovered via virtual unwind, and the
frame and stack pointers of the original method, which are likewise recovered by
virtual unwind.
computation is the callee-saves and frame and stack pointers.

If we were to support patchpoints in optimized code things would be more
complicated.
Expand Down Expand Up @@ -803,6 +810,7 @@ G_M6138_IG04: ;; bbWeight=0.01
488D4DF0 lea rcx, bword ptr [rbp-10H] // &patchpointCounter
BA06000000 mov edx, 6 // ilOffset
E808CA465F call CORINFO_HELP_PATCHPOINT
jmp rax

G_M6138_IG05:
8B45FC mov eax, dword ptr [rbp-04H]
Expand Down
20 changes: 13 additions & 7 deletions docs/design/features/OsrDetailsAndDebugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,15 @@ But often much of the Tier0 frame is effectively dead after the transition and e

The OSR prolog is conceptually similar to a normal method prolog, with a few key difference.

When an OSR method is entered, all callee-save registers have the values they had when the Tier0 method was called, but the values in argument registers are unknown (and almost certainly not the args passed to the Tier0 method). The OSR method must initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. This happens in `genEnregisterOSRArgsAndLocals`.
An OSR method is entered via a jump from the tier0 method.
This means callee save registers used by the tier0 method may require special handling:
- On x64, the OSR method keeps the original values of the callee saves in the tier0 frame.
They will be restored directly by the epilog, meaning that no instructions are needed.
- For other targets the callee saves used by tier0 are restored in the prolog, and they are then saved again in the OSR frame as normal.
Comment on lines +284 to +288
The above happens in `genOSRHandleTier0CalleeSavedRegistersAndFrame`.

The OSR method must also initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame.
This happens in `genEnregisterOSRArgsAndLocals`.

If the OSR method needs to report a generics context it uses the Tier0 frame slot; we ensure this is possible by forcing a Tier0 method with patchpoints to always report its generics context.

Expand Down Expand Up @@ -309,7 +317,7 @@ OSR funclets are more or less normal funclets.

#### OSR Unwind Info

On x64 the prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame.
The prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame.

As noted above the two SP adjusts in the x64 epilog are currently causing problems if we try and unwind in the epilog. Unwinding in the prolog and method body seems to work correctly; the unwind codes properly describe what needs to be done.

Expand All @@ -323,12 +331,11 @@ OSR GC info is standard. The only unusual aspect is that some special offsets (g

### Execution of an OSR Method

OSR methods are never called directly; they can only be invoked by `CORINFO_HELP_PATCHPOINT` when called from a Tier0 method with patchpoints.
OSR methods are never called directly; they can only be invoked by jump from a Tier0 method with patchpoints.

On x64, to preserve proper stack alignment, the runtime helper will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack.
On x64, to preserve proper stack alignment, the prolog will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack.

When the OSR method returns, it cleans up both its own stack and the
Tier0 method stack.
When the OSR method returns, it cleans up both its own stack and the Tier0 method stack.

Note if a Tier0 method is recursive and has loops there can be some interesting dynamics. After a sufficient amount of looping an OSR method will be created, and the currently active Tier0 instance will transition to the OSR method. When the OSR method makes a recursive call, it will invoke the Tier0 method, which will then fairly quickly transition to the OSR version just created.

Expand Down Expand Up @@ -474,7 +481,6 @@ to spend considerable time in OSR methods (e.g., the all-in-`Main` benchmark).

Generally speaking the performance of an OSR method should be comparable to the equivalent Tier1 method. In practice we see variations of +/- 20% or so. There are a number or reasons for this:
* OSR methods are often a subset of the full Tier1 method, and in many cases just comprise one loop. The JIT can often generate much better code for a single loop in isolation than a single loop in a more complex method.
* A few optimizations are disabled in OSR methods, notably struct promotion.
* OSR methods may only see fractional PGO data (as parts of the Tier0 method may not have executed yet). The JIT doesn't cope very well yet with this sort of partial PGO coverage.

### Impact on BenchmarkDotNet Results
Expand Down
10 changes: 2 additions & 8 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9108,10 +9108,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, BasicBlock* dst, regNu
emitTotalIGjmps++;
#endif

// Set the relocation flags - these give hint to zap to perform
// relocation of the specified 32bit address.
//
// Note the relocation flags influence the size estimate.
// Set reloc flags for AOT purposes. This also affects emitInsSizeAM below.
id->idSetRelocFlags(attr);

UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins));
Expand Down Expand Up @@ -9170,10 +9167,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, insGroup* dst, regNumb
emitTotalIGjmps++;
#endif

// Set the relocation flags - these give hint to zap to perform
// relocation of the specified 32bit address.
//
// Note the relocation flags influence the size estimate.
// Set reloc flags for AOT purposes. This also affects emitInsSizeAM below.
id->idSetRelocFlags(attr);

UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins));
Expand Down
Loading