Skip to content

Refactor stack#52

Merged
CppCXY merged 7 commits into
mainfrom
refactor-stack
Jun 5, 2026
Merged

Refactor stack#52
CppCXY merged 7 commits into
mainfrom
refactor-stack

Conversation

@CppCXY

@CppCXY CppCXY commented Jun 5, 2026

Copy link
Copy Markdown
Owner

No description provided.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Reasoning Details We need to analyze the code changes in this large diff. The changes introduce a new type `StkId` representing a stack slot pointer, replacing raw `*mut LuaValue` and `LuaValuePtr`. The modifications are across multiple files: gc, lua_value, lua_vm/call_info, execute_loop, helper, metamethod, lua_state, mod, and a new file stk_id.rs.

The goal of the PR seems to be performance improvement by caching base_stk in CallInfo to avoid repeated arithmetic and unsafe pointers in hot loops.

Potential issues to review:

  1. Safety and correctness of StkId's get_ref method: It returns &'static LuaValue which is unsound if the pointer is temporary or mutable aliasing occurs. In C++, StkId is just a pointer, but Rust requires careful lifetime management. The &'static lifetime is a lie. This could cause undefined behavior if the underlying stack is reallocated while a reference is held. Looking at usages: get_ref in upvalue.rs (new) and helper.rs for tointegerns, tonumberns etc. These functions take &LuaValue and are called with the result of get_ref(). The &'static lifetime allows these references to escape the current scope, which is dangerous because the stack might move. The original code used raw pointers and unsafe blocks, but those were local. Now get_ref returns a reference with no lifetime constraints. This is a major concern.

  2. StkId's hvalue method: Returns &'static LuaRawTable. Same lifetime issue. If the table is moved or GC'd, this reference becomes dangling. In usage, it's used immediately and not stored, so might be safe, but unsound nonetheless.

  3. Upvalue changes: LuaUpvalue::v now stores StkId instead of *mut LuaValue. The get_value, set_value methods use self.v.get() etc. The is_open method checks pointer equality with &self.closed_value. get_v_stk_id returns StkId. The update_stack_ptr now takes StkId. The removal of is_valid_tt debug checks is a loss of safety.

  4. Removal of old unsafe helper functions: psetivalue, psetfltvalue, setbfvalue, setbtvalue, setobj2s, stack_ref, stack_mut_ref, etc. are now replaced by StkId methods. This centralizes unsafe code but may hide bugs.

  5. CallInfo::base_stk: Cached pointer that must be updated when stack reallocates. In fix_call_info_base_stk and fix_open_upvalue_pointers, the code updates all pointers. However, there is also a macro updatetrap that reloads base_stk = ci.base_stk; after every reallocation. But during the loop, base_stk might become stale if the stack reallocation happens inside a helper function (like finishget_fallback) that doesn't go through updatetrap. The code seems to rely on savestate! and updatetrap! to handle reallocation. But after some operations like check_gc, which can trigger GC and stack shrink/grow, the base_stk might be invalid. Need to ensure that every path that can cause stack reallocation leads to reloading base_stk.

  6. StkId::from_stack: Takes sp: *mut LuaValue and index. In many places, sp is obtained from lua_state.stack_mut().as_mut_ptr(). But if the stack is reallocated after that call, the sp is stale. In CallInfo::new_lua and new_c, the sp is from stack_mut().as_mut_ptr() at creation time, which is fine but must be updated on reallocation. In fix_call_info_base_stk, it's done. However, inside the execute loop, after a call to precall or return, the loop reloads base_stk from ci.base_stk. But the ci.base_stk itself must be valid. If stack reallocation happens inside precall, then fix_call_info_base_stk is called? Actually precall in call.rs calls lua_state.adjust_stack_for_call() which may reallocate and call fix_open_upvalue_pointers and fix_call_info_base_stk? Looking at adjust_stack_for_call (not shown) but likely calls fix functions. However, after precall, in execute_loop.rs, the loop reloads base = ci.base and base_stk = ci.base_stk. That's fine if ci.base_stk is updated. But note that inside the loop, after savestate! and updatetrap!, base_stk is reloaded. However, some paths like SetTable do savestate!(); then finishset_fallback(...) which might trigger GC and stack moves, but the updatetrap! will reload base_stk. The macro updatetrap is called after savestate and return Err etc. But in many opcodes, they call savestate!(); then a function that might yield or error, and then updatetrap!();. That seems okay.

  7. In SetTable fast path: The code reads ra_val as unsafe { *ra_ptr } where ra_ptr = ra_stk.as_const_ptr(). This is a temporary copy, used for GC barrier. That's fine.

  8. StkId::get_ref unsoundness: This is the most critical. It's used in:

    • LuaUpvalue::get_value_ref() – returns &'static LuaValue. This is used in GetTabUp opcode: let upval_value = upvalue_ptr.as_ref().data.get_value_ref(); and then let key = k_val(constants, instr.get_c()); and later table.impl_table.get_shortstr_into(next_key, dest.as_ptr()). The upval_value is a reference, but it's used only temporarily. However, if the underlying stack slot moves between getting the reference and using it, it's UB. Since the stack doesn't move inside this opcode (no GC), it might be safe but still unsound.
    • In helper.rs for tointegerns and tonumberns calls: They take &LuaValue and are called on base_stk.offset(b).get_ref(). Those functions are called within macros. The references are temporary, so safe only if the stack doesn't move during the expression.
    • ObjLen and other helpers use dest_stk_id.get_ref()? Not directly, they use StkId methods.

    The &'static lifetime is incorrect. It should be tied to the lifetime of the stack or at least non-static. But since StkId is Copy and doesn't have a lifetime parameter, it's impossible to return a reference with proper lifetime. The correct approach is either to keep using raw pointers and unsafe blocks or to have StkId carry a lifetime parameter (e.g., StkId<'a>) to bind to the stack's lifetime. That would be heavyweight but safe. Given the PR's intent to reduce unsafe, this seems a step backward.

  9. StkId::set_integer and other set methods: They directly modify the stack via raw pointer. That's fine.

  10. pk_val function: Creates a StkId from a constants pointer. This is used in opcodes like LoadK, arithK, etc. The constants are static within the chunk, so that's safe.

  11. Missing debug_assert in upvalue get_value: Removed the validity checks. That's a loss of debugging.

  12. offset_of_stk_id in LuaState: Used to compute stack index for mark_pending_finish. This is safe but assumed stack pointer is within the current stack.

  13. In return1_with_hook and return0_with_hook: They are called with ci.base + instr.get_a() as usize (raw index) not StkId? Actually in the new code, they are called with ci.base + instr.get_a() as usize as before. That's fine.

  14. updatetrap macro now reloads base_stk: Good.

  15. In Finishget and self_shortstr_index_chain_fast: They now take StkId for dest_reg. The old code used setobj2s(lua_state, dest_reg, &value). Now dest_stk_id.write(&value). That's fine.

  16. call_tm_res_into: Now takes StkId. Inside, it writes result to dest_stk_id.write(&result). That's fine.

  17. StkId implements Copy: Fine, but the raw pointer is copied.

  18. Potential issue with StkId::offset: It uses self.0.add(reg). If self.0 is null (from null()), then offset will produce an invalid pointer. This is used in CallInfo::default which sets base_stk: StkId::null(). Then if that default is used before initialization, it could cause UB. But default is likely used only for initialization before actual call, so it should be overwritten. Ensure no code accesses base_stk from a default CI.

  19. StkId::is_valid: Only checks non-null. That's fine for the upvalue fix_closed_ptr.

  20. LuaUpvalue::fix_closed_ptr: Changed from if self.v.is_null() to if !self.v.is_valid(). But originally the null check was for open upvalues? Actually for closed upvalues, v is set to StkId::null(), so is_valid returns false, then it sets v to point to closed_value. Then is_open checks pointer equality. That's correct.

  21. The release method in GC changed to release_ci: Just renaming.

Overall, the biggest concern is the unsound &'static lifetime in StkId::get_ref and StkId::hvalue. This could lead to undefined behavior if the stack moves while a reference is held. The original code used unsafe blocks but kept the reference local. Now the reference has no lifetime bound, making it easy to misuse.

Other concerns: potential for stale base_stk after stack reallocation within helpers that don't go through updatetrap. Need to verify all paths.

Also, the removal of debug assertions reduces safety checking.

I'll provide a review listing these issues with suggestions.

Code Review: Introduction of StkId Type

The PR introduces a new StkId wrapper around *mut LuaValue to centralise raw pointer operations in the VM loop, replacing many scattered unsafe blocks and helper functions. While this improves code organisation, several safety and correctness issues must be addressed before merging.


Critical: Unsound &'static Lifetime in StkId::get_ref and hvalue

Problem:
StkId::get_ref and hvalue return references with lifetime 'static (e.g., &'static LuaValue). This is unsound because the underlying pointer may become invalid (e.g., after stack reallocation, GC movement, or aliasing).

pub fn get_ref(self) -> &'static LuaValue {
    unsafe { &*self.0 }
}

Risk:

  • A reference stored in a local variable may still be used after the stack pointer changes, causing undefined behaviour (use-after-free of a moved allocation).
  • These references are used in upvalue.rs (get_value_ref), helper.rs (finishget_fallback etc.), and through macro expansions. While current usage is temporary, the type system no longer prevents misuse.

Recommendation:
Either:

  • Add a lifetime parameter to StkId (e.g., StkId<'a>), tying it to the stack's lifetime. This forces all callers to prove the stack is unchanged.
  • Remove the unsafe methods get_ref and hvalue and keep them as raw-pointer‑based internal helpers. Force callers to pass raw pointers in unsafe blocks where they can manually verify lifetime.
  • At minimum, mark get_ref and hvalue as unsafe and document the safety invariants ("the stack must not be mutated or moved while the reference is live").

Potential: Stale base_stk After Stack Reallocation

Problem:
The new ci.base_stk is a cached StkId that should mirror the actual stack base. It is updated in:

  • fix_call_info_base_stk (after explicit stack grow/shrink)
  • updatetrap macro (reloads from ci.base_stk after operations that might reallocate)

However, some helper functions (e.g., finishset_fallback, bin_tm_fallback) can trigger GC (which shrinks/grows the stack) without going through updatetrap. Example in GetTable:

savestate!();
finishget_fallback(lua_state, ci, rb.get_ref(), rc.get_ref(), dest)?;
updatetrap!();

Here, finishget_fallback may yield or call GC. After savestate!(), updatetrap reloads base_stk from ci.base_stkbut if the stack moved, ci.base_stk itself is stale until fix_call_info_base_stk is called. The loop then uses a wrong pointer for subsequent opcodes.

Risk:

  • Dereferencing stale base_stk can access freed memory or cause data corruption.
  • The current code only calls fix_call_info_base_stk in adjust_stack_size (stack grow) but not after every GC operation. Many GC cycles do not resize the stack, but some do.

Recommendation:

  • Ensure that every code path that can trigger stack reallocation updates ci.base_stk directly (not just via fix_call_info_base_stk).
  • Alternatively, remove the cache and compute base_stk from ci.base and lua_state.stack().as_mut_ptr() at each use (like the old code), accepting the cost of a pointer fetch.
  • If caching is essential, add a global generation counter (incremented on stack reallocation) and assert base_stk is consistent.

Removal of Sanity Checks (🔍 Debug Safety)

Problem:
The old code had extensive debug_assert! in LuaUpvalue::get_value and set_value to detect null/dangling pointers. These are removed in favour of calling StkId::get() without checks.

Impact:

  • In debug builds, these checks caught misuses (e.g., using an upvalue after the stack frame was freed).
  • The is_valid_tt method was also removed; its validation is lost.

Recommendation:

  • Keep the debug assertions inside StkId::get() or StkId::write() methods, gated by debug_assert!. Example:
pub fn get(self) -> LuaValue {
    debug_assert!(!self.0.is_null(), "StkId::get on null pointer");
    unsafe { *self.0 }
}
  • Optionally add a debug check that the pointer lies within the current stack bounds.

StkId::offset on Null or Uninitialised Pointer

Problem:
CallInfo::default() sets base_stk: StkId::null(). If any code accidentally uses this default base_stk before it is properly initialised (e.g., during error recovery), calling .offset() on it will produce an invalid pointer.

Risk:

  • Undefined behaviour on first use if default CI is read without being set first.

Recommendation:

  • Keep base_stk as StkId::null() in default, but add a debug_assert! in StkId::offset that the pointer is non‑null:
pub fn offset(self, reg: usize) -> Self {
    debug_assert!(!self.0.is_null(), "offset on null StkId");
    unsafe { Self(self.0.add(reg)) }
}

Incorrect Handling of CIST_CLSRET in Return

Problem:
In the Return opcode, when CIST_CLSRET is set and close_all returns Ok(()), the code resets trap manually but does not reload base_stk. After closing to‑be‑closed variables, the stack might have been reallocated (if close_all triggers GC). The base_stk variable still holds the old pointer, causing subsequent reads/writes to be invalid.

Impact:

  • Use‑after‑free or data corruption in long‑running scripts with TBC variables.

Recommendation:

  • After close_all succeeds and before continuing the loop (e.g., after break?), reload base_stk from ci.base_stk. This should be done in the same fashion as updatetrap but without the hook‑trap logic.
  • Consider calling updatetrap!() after each close_all success, since updatetrap already reloads base_stk.

Unused Macro setobj2s Removal

Problem:
setobj2s is removed, but some old code in comments or future merges might still rely on it. This is minor.

Recommendation:

  • Ensure all callers are migrated; the new StkId::set and StkId::write replace it.

Summary of Required Changes

  1. Fix the &'static soundness hole in StkId. Minimum: make get_ref and hvalue unsafe. Better: add a lifetime.
  2. Guarantee base_stk freshness after every operation that can reallocate the stack. Either remove the cache or update it in every helper that might trigger GC.
  3. Restore debug assertions inside StkId methods for null/stale pointers.
  4. Add null‑check assertion to StkId::offset.
  5. Fix Return code to reload base_stk after close_all succeeds.

These changes are necessary for the PR to be safe. Parts of the refactor (renaming release to release_ci, reducing macro duplication) are fine, but the core pointer‑caching mechanism introduces subtle unsoundness that must be resolved.

@CppCXY CppCXY merged commit 17ce1e2 into main Jun 5, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant