dotnet · jakobbotsch · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/docs/design/features/OnStackReplacement.md b/docs/design/features/OnStackReplacement.md
@@ -237,15 +237,23 @@ pseudocode:
 ```
 Patchpoint:   // each assigned a dense set of IDs
 
-       if (++counter[ppID] > threshold) call PatchpointHelper(ppID)
+       if (++counter[ppID] > threshold)
+       {
+          var continuation = PatchpointHelper(ppID);
+          jmp continuation;
+       }
 ```
-The helper can use the return address to determine which patchpoint is making
-the request. To keep overheads manageable, we might instead want to down-count
-and pass the counter address to the helper.
+The helper can use the return address to determine which patchpoint is making the request.
+The return address is also used in case we should continue without transitioning into an OSR method.
+To keep overheads manageable, we might instead want to down-count and pass the counter address to the helper.
 ```
 Patchpoint:   // each assigned a dense set of IDs
 
-       if (--counter[ppID] <= 0) call PatchpointHelper(ppID, &counter[ppID])
+       if (--counter[ppID] <= 0)
+       {
+           var continuation = PatchpointHelper(ppID, &counter[ppID]);
+           jmp continuation;
+       }
 ```
 The helper logic would be similar to the following:
 ```
@@ -259,20 +267,20 @@ PatchpointHelper(int ppID, int* counter)
       case Unknown:
         *counter = initialThreshold;
         SetState(s, Active);
-        return;
+        return patchpointSite + <size of jmp>;
 
       case Active:
         *counter = checkThreshold;
         SetState(s, Pending);
         RequestAlternative(ppID);
-        return;
+        return patchpointSite + <size of jmp>;
 
       case Pending:
         *counter = checkThreshold;
-        return;
+        return patchpointSite + <size of jmp>;
 
       case Ready:
-         Transition(...); // does not return
+         return <address of alternative>;
      }
 }
 ```
@@ -477,15 +485,13 @@ this is to just leave the original frame in place, and have the OSR frame
 
 #### 3.4.1 Transition Implementation
 
-The original method conditionally calls to the patchpoint helper at
-patchpoints. The helper will return if there is no transition.
+The original method conditionally calls to the patchpoint helper at patchpoints.
+The helper returns a continuation address.
+If transition is desired, this is the address of the alternative version.
+Otherwise, it is the address in the tier0 code that follows the patchpoint helper call and jump instruction.
 
-For a transition, the helper will capture context and virtually unwind itself
-and the original method from the stack to recover callee-save register values
-live into the original method and then restore the callee FP and SP values into
-the context (preserving the original method frame); then set the context IP to
-the OSR method entry and restore context. OSR method will incorporate the
-original method frame as part of its frame.
+After transitioning the OSR method will incorporate the original method frame as part of its frame.
+This incorporation is slightly different between x64 and other targets. See below for more details.
 
 ## 4 Complications
 
@@ -658,23 +664,24 @@ prolog and duplicates its saves, and then a subsequent "shrink wrapped" prolog
 
 #### Implementation
 
-Callee-saves are currently handled sightly differently on x64
-than it is on arm64:
-* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame. The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses. THe OSR method then restores this entire set on exit, with a single stack pointer adjustment. See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details.
-* for arm64, the virtual unwind done by the runtime restores the Tier0 callee saves, so the OSR method saves and restores the full set of callee saves it uses, and then does a second stack pointer adjustment to pop the Tier0 frame.
-Eventually we will revise arm64 to behave more like x64.
-* float callee-saves are handled separately for tier0 and OSR methods; there is opportunity here to also share save space as we do for x64 integer registers,
-but this might also lead to needlessly large tier0 frames.
+Callee-saves are currently handled differently on x64 than it is on other targets:
+* on x64, all the integer callee saves are saved in space pre-reserved in the Tier0 frame.
+  The Tier0 method saves whatever subset it uses, and the OSR method saves any additional callee saves it uses.
+  The OSR method then restores this entire set on exit, with a single stack pointer adjustment.
+  See [OSR x64 Epilog Redesign](https://github.com/dotnet/runtime/blob/main/docs/design/features/OSRX64EpilogRedesign.md) and the pull request [revise approach for x64 OSR epilogs](https://github.com/dotnet/runtime/pull/65609) for details.
+* for other targets the OSR method first restores the full set of callee saves saved by the tier0 version.
+  Its used callee saves are then saved and restored from the OSR part of the stack frame, in the same way as any normal prolog.
+* For x64 we disallow the use of float callee-saves in the tier0 method.
+  This avoids the need for special restore logic for float callee saves in the OSR method.
+  For other platforms the handling of callee saves falls out naturally together with the integer register handling.
 
 You might think the runtime helper would need to carefully save all the register state
 on entry, but that's not the case. Because the original method is un-optimized,
 there isn't any live IL state in registers across the call to the patchpoint
 helper&mdash;all the live IL state for the method is on the original
 frame&mdash;so the argument and caller-save registers are dead at the
 patchpoint. Thus only part of register state that is significant for ongoing
-computation is the callee-saves, which are recovered via virtual unwind, and the
-frame and stack pointers of the original method, which are likewise recovered by
-virtual unwind.
+computation is the callee-saves and frame and stack pointers.
 
 If we were to support patchpoints in optimized code things would be more
 complicated.
@@ -803,6 +810,7 @@ G_M6138_IG04:           ;; bbWeight=0.01
        488D4DF0             lea      rcx, bword ptr [rbp-10H]    // &patchpointCounter
        BA06000000           mov      edx, 6                      // ilOffset
        E808CA465F           call     CORINFO_HELP_PATCHPOINT
+                            jmp      rax
 
 G_M6138_IG05:
        8B45FC               mov      eax, dword ptr [rbp-04H]

diff --git a/docs/design/features/OsrDetailsAndDebugging.md b/docs/design/features/OsrDetailsAndDebugging.md
@@ -281,7 +281,15 @@ But often much of the Tier0 frame is effectively dead after the transition and e
 
 The OSR prolog is conceptually similar to a normal method prolog, with a few key difference.
 
-When an OSR method is entered, all callee-save registers have the values they had when the Tier0 method was called, but the values in argument registers are unknown (and almost certainly not the args passed to the Tier0 method). The OSR method must initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame. This happens in `genEnregisterOSRArgsAndLocals`.
+An OSR method is entered via a jump from the tier0 method.
+This means callee save registers used by the tier0 method may require special handling:
+- On x64, the OSR method keeps the original values of the callee saves in the tier0 frame.
+  They will be restored directly by the epilog, meaning that no instructions are needed.
+- For other targets the callee saves used by tier0 are restored in the prolog, and they are then saved again in the OSR frame as normal.
+  The above happens in `genOSRHandleTier0CalleeSavedRegistersAndFrame`.
+
+The OSR method must also initialize any live-in enregistered args or locals from the corresponding slots on the Tier0 frame.
+This happens in `genEnregisterOSRArgsAndLocals`.
 
 If the OSR method needs to report a generics context it uses the Tier0 frame slot; we ensure this is possible by forcing a Tier0 method with patchpoints to always report its generics context.
 
@@ -309,7 +317,7 @@ OSR funclets are more or less normal funclets.
 
 #### OSR Unwind Info
 
-On x64 the prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame.
+The prolog unwind includes a phantom SP adjustment at offset 0 for the Tier0 frame.
 
 As noted above the two SP adjusts in the x64 epilog are currently causing problems if we try and unwind in the epilog. Unwinding in the prolog and method body seems to work correctly; the unwind codes properly describe what needs to be done.
 
@@ -323,12 +331,11 @@ OSR GC info is standard. The only unusual aspect is that some special offsets (g
 
 ### Execution of an OSR Method
 
-OSR methods are never called directly; they can only be invoked by `CORINFO_HELP_PATCHPOINT` when called from a Tier0 method with patchpoints.
+OSR methods are never called directly; they can only be invoked by jump from a Tier0 method with patchpoints.
 
-On x64, to preserve proper stack alignment, the runtime helper will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack.
+On x64, to preserve proper stack alignment, the prolog will "push" a phantom return address on the stack (x64 methods assume SP is aligned 8 mod 16 on entry). This is not necessary on arm64 as calls do not push to the stack.
 
-When the OSR method returns, it cleans up both its own stack and the
-Tier0 method stack.
+When the OSR method returns, it cleans up both its own stack and the Tier0 method stack.
 
 Note if a Tier0 method is recursive and has loops there can be some interesting dynamics. After a sufficient amount of looping an OSR method will be created, and the currently active Tier0 instance will transition to the OSR method. When the OSR method makes a recursive call, it will invoke the Tier0 method, which will then fairly quickly transition to the OSR version just created.
 
@@ -474,7 +481,6 @@ to spend considerable time in OSR methods (e.g., the all-in-`Main` benchmark).
 
 Generally speaking the performance of an OSR method should be comparable to the equivalent Tier1 method. In practice we see variations of +/- 20% or so. There are a number or reasons for this:
 * OSR methods are often a subset of the full Tier1 method, and in many cases just comprise one loop. The JIT can often generate much better code for a single loop in isolation than a single loop in a more complex method.
-* A few optimizations are disabled in OSR methods, notably struct promotion.
 * OSR methods may only see fractional PGO data (as parts of the Tier0 method may not have executed yet). The JIT doesn't cope very well yet with this sort of partial PGO coverage.
 
 ### Impact on BenchmarkDotNet Results

diff --git a/src/coreclr/jit/emitxarch.cpp b/src/coreclr/jit/emitxarch.cpp
@@ -9108,10 +9108,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, BasicBlock* dst, regNu
     emitTotalIGjmps++;
 #endif
 
-    // Set the relocation flags - these give hint to zap to perform
-    // relocation of the specified 32bit address.
-    //
-    // Note the relocation flags influence the size estimate.
+    // Set reloc flags for AOT purposes. This also affects emitInsSizeAM below.
     id->idSetRelocFlags(attr);
 
     UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins));
@@ -9170,10 +9167,7 @@ void emitter::emitIns_R_L(instruction ins, emitAttr attr, insGroup* dst, regNumb
     emitTotalIGjmps++;
 #endif
 
-    // Set the relocation flags - these give hint to zap to perform
-    // relocation of the specified 32bit address.
-    //
-    // Note the relocation flags influence the size estimate.
+    // Set reloc flags for AOT purposes. This also affects emitInsSizeAM below.
     id->idSetRelocFlags(attr);
 
     UNATIVE_OFFSET sz = emitInsSizeAM(id, insCodeRM(ins));