From f3e4d29b7e696ba20b9f5ad0eddea26e6d2078cc Mon Sep 17 00:00:00 2001 From: Jakob Botsch Nielsen Date: Fri, 24 Apr 2026 14:31:56 +0200 Subject: [PATCH 1/4] Update OSR docs --- docs/design/features/OnStackReplacement.md | 57 ++++++++++--------- .../design/features/OsrDetailsAndDebugging.md | 17 +++--- src/coreclr/jit/emitxarch.cpp | 10 +--- 3 files changed, 43 insertions(+), 41 deletions(-) diff --git a/docs/design/features/OnStackReplacement.md b/docs/design/features/OnStackReplacement.md index db202a22666e40..853d6b5722231a 100644 --- a/docs/design/features/OnStackReplacement.md +++ b/docs/design/features/OnStackReplacement.md @@ -237,15 +237,23 @@ pseudocode: ``` Patchpoint: // each assigned a dense set of IDs - if (++counter[ppID] > threshold) call PatchpointHelper(ppID) + if (++counter[ppID] > threshold) + { + var continuation = PatchpointHelper(ppID); + jmp continuation; + } ``` -The helper can use the return address to determine which patchpoint is making -the request. To keep overheads manageable, we might instead want to down-count -and pass the counter address to the helper. +The helper can use the return address to determine which patchpoint is making the request. +The return address is also used in case we should continue without transitioning into an OSR method. +To keep overheads manageable, we might instead want to down-count and pass the counter address to the helper. ``` Patchpoint: // each assigned a dense set of IDs - if (--counter[ppID] <= 0) call PatchpointHelper(ppID, &counter[ppID]) + if (--counter[ppID] <= 0) + { + var continuation = PatchpointHelper(ppID, &counter[ppID]); + jmp continuation; + } ``` The helper logic would be similar to the following: ``` @@ -259,20 +267,20 @@ PatchpointHelper(int ppID, int* counter) case Unknown: *counter = initialThreshold; SetState(s, Active); - return; + return patchpointSite + ; case Active: *counter = checkThreshold; SetState(s, Pending); RequestAlternative(ppID); - return; + return patchpointSite + ; case Pending: *counter = checkThreshold; - return; + return patchpointSite + ; case Ready: - Transition(...); // does not return + return
; } } ``` @@ -477,15 +485,13 @@ this is to just leave the original frame in place, and have the OSR frame #### 3.4.1 Transition Implementation -The original method conditionally calls to the patchpoint helper at -patchpoints. The helper will return if there is no transition. +The original method conditionally calls to the patchpoint helper at patchpoints. +The helper returns a continuation address. +If transition is desired, this is the address of the alternative version. +Otherwise, it is the address in the tier0 code that follows the patchpoint helper call and jump instruction after. -For a transition, the helper will capture context and virtually unwind itself -and the original method from the stack to recover callee-save register values -live into the original method and then restore the callee FP and SP values into -the context (preserving the original method frame); then set the context IP to -the OSR method entry and restore context. OSR method will incorporate the -original method frame as part of its frame. +After transitioning the OSR method will incorporate the original method frame as part of its frame. +This incorporation is slightly different between x64 and other targets. See below for more details. ## 4 Complications @@ -659,12 +665,12 @@ prolog and duplicates its saves, and then a subsequent "shrink wrapped" prolog #### Implementation Callee-saves are currently handled sightly differently on x64 -than it is on arm64: -* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. THe OSR method then restores this entire set on exit, with a single stack pointer adjustment. See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details. -* for arm64, the virtual unwind done by the runtime restores the Tier0 callee saves, so the OSR method saves and restores the full set of callee saves it uses, and then does a second stack pointer adjustment to pop the Tier0 frame. -Eventually we will revise arm64 to behave more like x64. -* float callee-saves are handled separately for tier0 and OSR methods; there is opportunity here to also share save space as we do for x64 integer registers, -but this might also lead to needlessly large tier0 frames. +than it is on other targets: +* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. The OSR method then restores this entire set on exit, with a single stack pointer adjustment. See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details. +* for other targets the OSR method first restores the full set of callee saves saved by the tier0 version. + Its used callee saves are then saved and restored from the OSR part of the stack frame. +* For x64 we disallow the use of float callee-saves in the tier0 method. + This avoids the need for special restore logic for float callee saves in the OSR method. You might think the runtime helper would need to carefully save all the register state on entry, but that's not the case. Because the original method is un-optimized, @@ -672,9 +678,7 @@ there isn't any live IL state in registers across the call to the patchpoint helper—all the live IL state for the method is on the original frame—so the argument and caller-save registers are dead at the patchpoint. Thus only part of register state that is significant for ongoing -computation is the callee-saves, which are recovered via virtual unwind, and the -frame and stack pointers of the original method, which are likewise recovered by -virtual unwind. +computation is the callee-saves and frame and stack pointers. If we were to support patchpoints in optimized code things would be more complicated. @@ -803,6 +807,7 @@ G_M6138_IG04: ;; bbWeight=0.01 488D4DF0 lea rcx, bword ptr [rbp-10H] // &patchpointCounter BA06000000 mov edx, 6 // ilOffset E808CA465F call CORINFO_HELP_PATCHPOINT + jmp rax G_M6138_IG05: 8B45FC mov eax, dword ptr [rbp-04H] diff --git a/docs/design/features/OsrDetailsAndDebugging.md b/docs/design/features/OsrDetailsAndDebugging.md index e1080fbc8bd792..64b08060ff4a7a 100644 --- a/docs/design/features/OsrDetailsAndDebugging.md +++ b/docs/design/features/OsrDetailsAndDebugging.md @@ -281,7 +281,12 @@ But often much of the Tier0 frame is effectively dead after the transition and e The OSR prolog is conceptually similar to a normal method prolog, with a few key difference. -When an OSR method is entered, all callee-save registers have the values they had when the Tier0 method was called, but the values in argument registers are unknown (and almost certainly not the args passed to the Tier0 method). The OSR method must initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. This happens in `genEnregisterOSRArgsAndLocals`. +An OSR method is entered via a jump from the tier0 method. At this point callee save registers used by the tier0 method may require handling: +- On x64, the OSR method keeps the original values of the callee saves in the tier0 frame. They will be restored directly by the epilog. +- For other targets the callee saves used by tier0 are restored in the prolog, and they are then saved again in the OSR frame as normal. +The above happens in `genOSRHandleTier0CalleeSavedRegistersAndFrame`. + +The OSR method must also initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. This happens in `genEnregisterOSRArgsAndLocals`. If the OSR method needs to report a generics context it uses the Tier0 frame slot; we ensure this is possible by forcing a Tier0 method with patchpoints to always report its generics context. @@ -309,7 +314,7 @@ OSR funclets are more or less normal funclets. #### OSR Unwind Info -On x64 the prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame. +The prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame. As noted above the two SP adjusts in the x64 epilog are currently causing problems if we try and unwind in the epilog. Unwinding in the prolog and method body seems to work correctly; the unwind codes properly describe what needs to be done. @@ -323,12 +328,11 @@ OSR GC info is standard. The only unusual aspect is that some special offsets (g ### Execution of an OSR Method -OSR methods are never called directly; they can only be invoked by `CORINFO_HELP_PATCHPOINT` when called from a Tier0 method with patchpoints. +OSR methods are never called directly; they can only be invoked by jump from a Tier0 method with patchpoints. -On x64, to preserve proper stack alignment, the runtime helper will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack. +On x64, to preserve proper stack alignment, the prolog will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack. -When the OSR method returns, it cleans up both its own stack and the -Tier0 method stack. +When the OSR method returns, it cleans up both its own stack and the Tier0 method stack. Note if a Tier0 method is recursive and has loops there can be some interesting dynamics. After a sufficient amount of looping an OSR method will be created, and the currently active Tier0 instance will transition to the OSR method. When the OSR method makes a recursive call, it will invoke the Tier0 method, which will then fairly quickly transition to the OSR version just created. @@ -474,7 +478,6 @@ to spend considerable time in OSR methods (e.g., the all-in-`Main` benchmark). Generally speaking the performance of an OSR method should be comparable to the equivalent Tier1 method. In practice we see variations of +/- 20% or so. There are a number or reasons for this: * OSR methods are often a subset of the full Tier1 method, and in many cases just comprise one loop. The JIT can often generate much better code for a single loop in isolation than a single loop in a more complex method. -* A few optimizations are disabled in OSR methods, notably struct promotion. * OSR methods may only see fractional PGO data (as parts of the Tier0 method may not have executed yet). The JIT doesn't cope very well yet with this sort of partial PGO coverage. ### Impact on BenchmarkDotNet Results diff --git a/src/coreclr/jit/emitxarch.cpp b/src/coreclr/jit/emitxarch.cpp index c4ddbfd26bc6d9..056ffbd232ad8d 100644 --- a/src/coreclr/jit/emitxarch.cpp +++ b/src/coreclr/jit/emitxarch.cpp @@ -9108,10 +9108,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, BasicBlock* dst, regNu emitTotalIGjmps++; #endif - // Set the relocation flags - these give hint to zap to perform - // relocation of the specified 32bit address. - // - // Note the relocation flags influence the size estimate. + // Set reloc flags for AOT purposes id->idSetRelocFlags(attr); UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins)); @@ -9170,10 +9167,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, insGroup* dst, regNumb emitTotalIGjmps++; #endif - // Set the relocation flags - these give hint to zap to perform - // relocation of the specified 32bit address. - // - // Note the relocation flags influence the size estimate. + // Set reloc flags for AOT purposes id->idSetRelocFlags(attr); UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins)); From 591cf3c90b74e3a11a7a87dd605704dd28617f65 Mon Sep 17 00:00:00 2001 From: Jakob Botsch Nielsen Date: Fri, 24 Apr 2026 14:38:56 +0200 Subject: [PATCH 2/4] Adjustments --- docs/design/features/OnStackReplacement.md | 13 ++++++++----- docs/design/features/OsrDetailsAndDebugging.md | 11 +++++++---- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/docs/design/features/OnStackReplacement.md b/docs/design/features/OnStackReplacement.md index 853d6b5722231a..ba8e7b8a1e4629 100644 --- a/docs/design/features/OnStackReplacement.md +++ b/docs/design/features/OnStackReplacement.md @@ -488,7 +488,7 @@ this is to just leave the original frame in place, and have the OSR frame The original method conditionally calls to the patchpoint helper at patchpoints. The helper returns a continuation address. If transition is desired, this is the address of the alternative version. -Otherwise, it is the address in the tier0 code that follows the patchpoint helper call and jump instruction after. +Otherwise, it is the address in the tier0 code that follows the patchpoint helper call and jump instruction. After transitioning the OSR method will incorporate the original method frame as part of its frame. This incorporation is slightly different between x64 and other targets. See below for more details. @@ -664,13 +664,16 @@ prolog and duplicates its saves, and then a subsequent "shrink wrapped" prolog #### Implementation -Callee-saves are currently handled sightly differently on x64 -than it is on other targets: -* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. The OSR method then restores this entire set on exit, with a single stack pointer adjustment. See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details. +Callee-saves are currently handled sightly differently on x64 than it is on other targets: +* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. + The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. + The OSR method then restores this entire set on exit, with a single stack pointer adjustment. + See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details. * for other targets the OSR method first restores the full set of callee saves saved by the tier0 version. - Its used callee saves are then saved and restored from the OSR part of the stack frame. + Its used callee saves are then saved and restored from the OSR part of the stack frame, in the same way as any normal prolog. * For x64 we disallow the use of float callee-saves in the tier0 method. This avoids the need for special restore logic for float callee saves in the OSR method. + For other platforms the handling of callee saves falls out naturally from the integer register handling. You might think the runtime helper would need to carefully save all the register state on entry, but that's not the case. Because the original method is un-optimized, diff --git a/docs/design/features/OsrDetailsAndDebugging.md b/docs/design/features/OsrDetailsAndDebugging.md index 64b08060ff4a7a..262a0d435524cd 100644 --- a/docs/design/features/OsrDetailsAndDebugging.md +++ b/docs/design/features/OsrDetailsAndDebugging.md @@ -281,12 +281,15 @@ But often much of the Tier0 frame is effectively dead after the transition and e The OSR prolog is conceptually similar to a normal method prolog, with a few key difference. -An OSR method is entered via a jump from the tier0 method. At this point callee save registers used by the tier0 method may require handling: -- On x64, the OSR method keeps the original values of the callee saves in the tier0 frame. They will be restored directly by the epilog. +An OSR method is entered via a jump from the tier0 method. +This means callee save registers used by the tier0 method may require special handling: +- On x64, the OSR method keeps the original values of the callee saves in the tier0 frame. + They will be restored directly by the epilog, meaning that no instructions are needed. - For other targets the callee saves used by tier0 are restored in the prolog, and they are then saved again in the OSR frame as normal. -The above happens in `genOSRHandleTier0CalleeSavedRegistersAndFrame`. + The above happens in `genOSRHandleTier0CalleeSavedRegistersAndFrame`. -The OSR method must also initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. This happens in `genEnregisterOSRArgsAndLocals`. +The OSR method must also initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. +This happens in `genEnregisterOSRArgsAndLocals`. If the OSR method needs to report a generics context it uses the Tier0 frame slot; we ensure this is possible by forcing a Tier0 method with patchpoints to always report its generics context. From 06def895c09e57f1a85cf54a01cb7c9169765151 Mon Sep 17 00:00:00 2001 From: Jakob Botsch Nielsen Date: Fri, 24 Apr 2026 14:54:29 +0200 Subject: [PATCH 3/4] Feedback --- src/coreclr/jit/emitxarch.cpp | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/coreclr/jit/emitxarch.cpp b/src/coreclr/jit/emitxarch.cpp index 056ffbd232ad8d..efb370d05378bb 100644 --- a/src/coreclr/jit/emitxarch.cpp +++ b/src/coreclr/jit/emitxarch.cpp @@ -9108,7 +9108,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, BasicBlock* dst, regNu emitTotalIGjmps++; #endif - // Set reloc flags for AOT purposes + // Set reloc flags for AOT purposes. This also affects emitInsSizeAM below. id->idSetRelocFlags(attr); UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins)); @@ -9167,7 +9167,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, insGroup* dst, regNumb emitTotalIGjmps++; #endif - // Set reloc flags for AOT purposes + // Set reloc flags for AOT purposes. This also affects emitInsSizeAM below. id->idSetRelocFlags(attr); UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins)); From 4d73b2343c4629ff96636a28498758fa85f05416 Mon Sep 17 00:00:00 2001 From: Jakob Botsch Nielsen Date: Fri, 24 Apr 2026 14:56:13 +0200 Subject: [PATCH 4/4] Feedback --- docs/design/features/OnStackReplacement.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/features/OnStackReplacement.md b/docs/design/features/OnStackReplacement.md index ba8e7b8a1e4629..d8b08f4a1edeaf 100644 --- a/docs/design/features/OnStackReplacement.md +++ b/docs/design/features/OnStackReplacement.md @@ -664,7 +664,7 @@ prolog and duplicates its saves, and then a subsequent "shrink wrapped" prolog #### Implementation -Callee-saves are currently handled sightly differently on x64 than it is on other targets: +Callee-saves are currently handled differently on x64 than it is on other targets: * on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. The OSR method then restores this entire set on exit, with a single stack pointer adjustment. @@ -673,7 +673,7 @@ Callee-saves are currently handled sightly differently on x64 than it is on othe Its used callee saves are then saved and restored from the OSR part of the stack frame, in the same way as any normal prolog. * For x64 we disallow the use of float callee-saves in the tier0 method. This avoids the need for special restore logic for float callee saves in the OSR method. - For other platforms the handling of callee saves falls out naturally from the integer register handling. + For other platforms the handling of callee saves falls out naturally together with the integer register handling. You might think the runtime helper would need to carefully save all the register state on entry, but that's not the case. Because the original method is un-optimized,