Skip to content

feat(script): add SCRIPT KILL and lua-time-limit support#3505

Open
xtco3o wants to merge 3 commits into
apache:unstablefrom
xtco3o:dev
Open

feat(script): add SCRIPT KILL and lua-time-limit support#3505
xtco3o wants to merge 3 commits into
apache:unstablefrom
xtco3o:dev

Conversation

@xtco3o
Copy link
Copy Markdown

@xtco3o xtco3o commented May 28, 2026

References:

This PR adds support for the lua-time-limit configuration option and the SCRIPT KILL command to interrupt long-running Lua scripts.

Changes and Technical Rationale:

Lua Hook with JIT Execution

  • Register LuaMaskCountHook via lua_sethook to check running time every 100,000 instructions.
  • Pass -DLUAJIT_ENABLE_CHECKHOOK to luajit compilation flags (cmake/luajit.cmake). This allows the LuaJIT VM to check hooks inside compiled machine code without disabling the JIT engine, which preserves execution performance.
  • Avoid linker crash on macOS Xcode 15/16 by adding -Wl,-no_deduplicate to linker flags when luajit is enabled (CMakeLists.txt).

Lock Bypass for SCRIPT KILL

  • Bypass WorkConcurrencyGuard and WorkExclusivityGuard for SCRIPT KILL and SHUTDOWN (src/server/redis_connection.cc). If these commands required database locks, they would queue behind the blocked script and never execute. Bypassing guards allows SCRIPT KILL to run concurrently on another worker thread to set the is_killed flag.
  • Intercept incoming commands and return BUSY error if a script is timed out.

macOS Listener Socket Sharing

  • Share the listener socket among worker threads via dup() on macOS instead of binding separate sockets (src/server/worker.cc). On macOS, SO_REUSEPORT routes connections to the worker that bound the socket. If that worker is blocked by a script, new connections queue in the backlog. Sharing the socket allows idle workers to accept connections and process SCRIPT KILL.

Lock-Free Fast Path for Timeout Check

  • Track running scripts in a list in Server (src/server/server.h, src/server/server.cc).
  • Add running_script_count_ atomic counter. This allows an O(1) lock-free check to bypass timeout evaluations on the command path when no scripts are running.

SSL Output Flushing in Event Loop

  • Implement Worker::PollEventLoop to run the event loop and flush buffers (src/server/worker.cc).
  • Detect SSL connections using bufferevent_openssl_get_ssl before writing to avoid writing unencrypted bytes to TLS sockets.

Tests

  • Add integration tests for SCRIPT KILL and lua-time-limit (tests/gocase/unit/scripting/scripting_test.go).
  • Modify test framework directory creation on macOS to handle slashes in test names (tests/gocase/util/server.go).

This commit merges all changes of the current dev branch compared to unstable into a single commit. It includes implementation of the lua-time-limit configuration option, SCRIPT KILL command to safely interrupt long-running scripts, connection guard optimizations, worker event loop recursive event handling, and macOS TCP listener socket sharing.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Redis-compatible SCRIPT KILL command and a lua-time-limit configuration to interrupt long-running Lua scripts. The implementation registers a Lua count hook that detects timeouts and disconnected clients, tracks running script contexts on the Server, lets SCRIPT KILL/SHUTDOWN bypass concurrency locks, and (on macOS) shares listener fds across workers via dup() so idle workers can still accept connections while a worker is blocked in a script. The PR also enables LUAJIT_ENABLE_CHECKHOOK so hooks fire from JIT-compiled code and adds Worker::PollEventLoop() to drain output buffers (TLS-aware) during long script execution.

Changes:

  • Add ScriptRunCtxGuard + LuaMaskCountHook to enforce timeout/kill semantics, with Server-side running-script registry and ScriptKill() returning NOTBUSY/UNKILLABLE/OK.
  • Intercept commands with a BUSY reply when a script timed out and bypass concurrency guards for SCRIPT KILL/SHUTDOWN; add Worker::PollEventLoop for periodic event/output flushing, including TLS handling.
  • macOS: share listener fds across workers and pass -DLUAJIT_ENABLE_CHECKHOOK / -Wl,-no_deduplicate build flags; add Go tests and sanitize test name slashes for temp dirs.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/storage/scripting.h Adds LuaMaskCountHook decl and timing/kill fields to ScriptRunCtx.
src/storage/scripting.cc Implements ScriptRunCtxGuard, LuaMaskCountHook, KillScript, write-dirty tracking, and wires guard into EvalGenericCommand/FunctionCall.
src/server/server.h / server.cc Adds running-script registry, ScriptKill, BUSY flag, kills scripts on shutdown, and shares macOS listener fds when constructing/adjusting workers.
src/server/worker.h / worker.cc New ctor taking shared TCP listen fds, tracks tcp_listen_fds_, adds PollEventLoop with TLS-aware flushing.
src/server/redis_connection.cc Async-close handling while command runs, OnRead reentrancy guard, BUSY interception, and lock bypass for SCRIPT KILL/SHUTDOWN.
src/commands/cmd_script.cc Adds SCRIPT KILL subcommand and allows it on replicas.
src/config/config.h / config.cc Adds mutable lua-time-limit integer config.
kvrocks.conf Documents lua-time-limit.
CMakeLists.txt Adds Apple+LuaJIT linker flag and ENABLE_LUAJIT compile definition.
cmake/luajit.cmake Switches to XCFLAGS, adds LUAJIT_ENABLE_CHECKHOOK, preserves LuaJIT's own CFLAGS.
tests/gocase/unit/scripting/scripting_test.go Adds SCRIPT KILL, BUSY, UNKILLABLE, and disconnected-client tests.
tests/gocase/util/server.go Sanitizes / in t.Name() for the temp directory pattern.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/redis_connection.cc Outdated
Comment on lines +457 to +467
bool is_script_kill = (util::EqualICase(cmd_tokens.front(), "script") && cmd_tokens.size() >= 2 &&
util::EqualICase(cmd_tokens[1], "kill"));
bool is_shutdown = util::EqualICase(cmd_tokens.front(), "shutdown");

if (srv_->IsScriptTimedOut()) {
if (!is_script_kill && !is_shutdown) {
Reply(redis::Error({Status::RedisErrorNoPrefix,
"BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE."}));
continue;
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I've tightened the matching condition to check for exactly two tokens (cmd_tokens.size() == 2) for SCRIPT KILL commands. Furthermore, I moved the IsScriptTimedOut() check to occur after the authentication and namespace checks, but still before acquiring any concurrency/exclusivity guards, preventing unauthenticated clients from bypassing the BUSY check.

Comment thread src/server/worker.cc
Comment on lines +621 to +625
} else {
auto *output = bufferevent_get_output(bev);
if (evbuffer_get_length(output) > 0) {
evbuffer_write(output, conn->GetFD());
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I've updated the logic to check the return value of evbuffer_write and log non-recoverable errors (i.e. those other than EAGAIN, EWOULDBLOCK, and EINTR) to prevent silent failures. Bypassing write watermarks, callbacks, and rate limiting is acceptable here because client bufferevents in Kvrocks do not utilize these features.

Comment thread src/storage/scripting.cc
};

static void KillScript(lua_State *lua) {
lua_sethook(lua, LuaMaskCountHook, LUA_MASKLINE, 0);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified with comments. We keep the LUA_MASKLINE mask active on the hook so that if a script attempts to catch the raised error using pcall or xpcall to continue running, the hook will immediately trigger again on the next line and re-raise the error. This ensures the script is terminated. Added inline documentation to clarify this intent.

Comment thread src/server/server.cc
Comment on lines +1924 to +1925
bool Server::IsScriptTimedOut() const { return is_script_timeout_.load(std::memory_order_relaxed); }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added ReevaluateScriptTimeout() method to Server and registered a configuration callback for lua-time-limit in config.cc. When lua-time-limit is changed via CONFIG SET, ReevaluateScriptTimeout() is called to dynamically update is_script_timeout_. Also adjusted the default value of lua-time-limit to 5000 in both config.cc and kvrocks.conf.

@jihuayu
Copy link
Copy Markdown
Member

jihuayu commented May 28, 2026

Hi @xtco3o Thanks for your PR. Please read our AI guidelines at https://kvrocks.apache.org/community/contributing#guidelines-for-ai-assisted-contributions.
If you have used AI tools in your development, please let us know. The use of AI tools is welcome; this allows us to understand your workflow and share some best practices with you.

This is a highly technical PR. I’ll need to study it carefully.

Comment thread src/storage/scripting.cc

if (should_poll) {
auto *worker = script_run_ctx->conn->Owner();
worker->PollEventLoop();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this poll a bit strange. Are you sure it won’t introduce any security issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants