<feature>[kvm]: add kvmagent auto-restart window config#3863
<feature>[kvm]: add kvmagent auto-restart window config#3863zstack-robot-2 wants to merge 2 commits into5.5.16from
Conversation
Add KVMAGENT_AUTO_RESTART_WINDOW (kvm.kvmagent.autorestart.window) so the automatic kvmagent restart triggered by the physical-memory hard-limit alarm only fires within a configured daily time window. Format: HH:MM-HH:MM in 24-hour server local time, e.g. 02:00-04:00. Cross-midnight windows are supported, e.g. 22:00-02:00. Empty default means always allowed (preserves existing behavior on upgrade). The gate is added in processKvmagentPhysicalMemUsageAbnormal() between the existing hard-limit check and the RestartKvmAgentMsg send. The 'no running task on host' check inside restartKvmAgentOnHost is unchanged, so the final guard is: in-window AND over-hardlimit AND no-host-tasks. A GlobalConfigValidatorExtensionPoint is registered in start() to reject malformed values inline, following the pattern used by RESERVED_MEMORY_CAPACITY (no try/catch wrappers). Includes unit tests for the window-membership function covering empty/null, normal window, half-open boundary, cross-midnight, and 00:00-23:59 edge cases. Resolves: ZSTAC-84618 Change-Id: I872bfe96fe30cb83dec21d40157bb315966978ba
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (2)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Walkthrough该PR为KVM代理的物理内存硬限制自动重启功能添加了时间窗口控制。新增全局配置参数 Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java`:
- Around line 152-153: The GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW currently
uses the bare `@GlobalConfigValidation` which enforces notNull=true/notEmpty=true
and thus prevents using null/empty to mean "no time window"; update the
annotation on KVMAGENT_AUTO_RESTART_WINDOW to explicitly allow null/empty (e.g.,
set notNull=false and notEmpty=false) so users can clear the configured window
to mean "always allowed", and ensure any downstream validation logic that reads
KVMAGENT_AUTO_RESTART_WINDOW honors null/empty as "no restriction".
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java`:
- Around line 479-490: The method isNowInAutoRestartWindow currently calls
LocalTime.parse on configValue parts without validation, which can throw and
break alert flows for dirty config; update isNowInAutoRestartWindow to validate
parts.length == 2 and wrap parsing in a try-catch that catches
DateTimeParseException/RuntimeException, emit a logger.warn including the
offending configValue and exception, and return false (treat as outside window)
on any parse/validation failure so malformed values do not throw.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 688e3a12-d6ba-4cba-a18d-4750fa67950d
⛔ Files ignored due to path filters (3)
conf/globalConfig/kvm.xmlis excluded by!**/*.xmlplugin/kvm/pom.xmlis excluded by!**/*.xmltest/src/test/resources/globalConfig/kvm.xmlis excluded by!**/*.xml
📒 Files selected for processing (3)
plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.javaplugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.javaplugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java
| @GlobalConfigValidation | ||
| public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window"); |
There was a problem hiding this comment.
空值语义与当前注解默认校验存在冲突。
这里使用裸 @GlobalConfigValidation 会沿用默认 notNull=true/notEmpty=true,与“空值表示不限时段”的需求冲突;用户也无法把已配置窗口恢复为“始终允许”。
💡 建议修复
- `@GlobalConfigValidation`
- public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window");
+ `@GlobalConfigValidation`(notNull = false, notEmpty = false)
+ `@GlobalConfigDef`(
+ defaultValue = "",
+ description = "daily local-time window for kvmagent auto restart, format HH:MM-HH:MM; empty means always allowed"
+ )
+ public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java` around lines
152 - 153, The GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW currently uses the bare
`@GlobalConfigValidation` which enforces notNull=true/notEmpty=true and thus
prevents using null/empty to mean "no time window"; update the annotation on
KVMAGENT_AUTO_RESTART_WINDOW to explicitly allow null/empty (e.g., set
notNull=false and notEmpty=false) so users can clear the configured window to
mean "always allowed", and ensure any downstream validation logic that reads
KVMAGENT_AUTO_RESTART_WINDOW honors null/empty as "no restriction".
| static boolean isNowInAutoRestartWindow(String configValue, LocalTime now) { | ||
| if (configValue == null || configValue.trim().isEmpty()) { | ||
| return true; | ||
| } | ||
| String[] parts = configValue.trim().split("-"); | ||
| LocalTime start = LocalTime.parse(parts[0]); | ||
| LocalTime end = LocalTime.parse(parts[1]); | ||
| if (start.isBefore(end)) { | ||
| return !now.isBefore(start) && now.isBefore(end); | ||
| } | ||
| return !now.isBefore(start) || now.isBefore(end); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== 检查窗口解析实现是否存在无兜底解析 =="
rg -n -A20 -B5 'static boolean isNowInAutoRestartWindow|LocalTime\.parse|split\("-"\)' plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
echo
echo "== 检查测试是否覆盖非法配置输入场景 =="
rg -n 'inWindow\(|assert(True|False)|Exception' plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java
rg -n '25:00|aa:bb|invalid|malformed|02:00-02:00|02:00-' plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java || trueRepository: MatheMatrix/zstack
Length of output: 5207
添加错误处理防止脏配置导致告警流程中断。
当前代码直接解析时间字符串而无任何错误处理;虽然全局配置验证器可阻止新的非法值写入数据库,但无法防护:
- 验证器添加前已存在的历史脏数据
- 绕过 API 直接修改数据库的数据
- 迁移脚本或手工编辑引入的非法数据
若这些脏数据在生产环境触发 LocalTime.parse(),会抛出运行时异常并中断告警处理流程,影响稳定性。建议在解析失败时记录 warn 并按窗口外处理(返回 false)。
建议修复
static boolean isNowInAutoRestartWindow(String configValue, LocalTime now) {
if (configValue == null || configValue.trim().isEmpty()) {
return true;
}
- String[] parts = configValue.trim().split("-");
- LocalTime start = LocalTime.parse(parts[0]);
- LocalTime end = LocalTime.parse(parts[1]);
+ String[] parts = configValue.trim().split("-");
+ if (parts.length != 2) {
+ logger.warn(String.format("invalid auto-restart window config[%s], treat as out-of-window", configValue));
+ return false;
+ }
+
+ final LocalTime start;
+ final LocalTime end;
+ try {
+ start = LocalTime.parse(parts[0].trim());
+ end = LocalTime.parse(parts[1].trim());
+ } catch (Exception e) {
+ logger.warn(String.format("invalid auto-restart window config[%s], treat as out-of-window", configValue), e);
+ return false;
+ }
+
if (start.isBefore(end)) {
return !now.isBefore(start) && now.isBefore(end);
}
return !now.isBefore(start) || now.isBefore(end);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java` around lines 479
- 490, The method isNowInAutoRestartWindow currently calls LocalTime.parse on
configValue parts without validation, which can throw and break alert flows for
dirty config; update isNowInAutoRestartWindow to validate parts.length == 2 and
wrap parsing in a try-catch that catches
DateTimeParseException/RuntimeException, emit a logger.warn including the
offending configValue and exception, and return false (treat as outside window)
on any parse/validation failure so malformed values do not throw.
Empty default disabled the time gate by accident; ship a sane maintenance window so auto-restart only fires during low-traffic hours out of the box. Operators who want 24/7 restart can clear the value. Resolves: ZSTAC-84618 Change-Id: I807bc32502780525015a8beffdb06ca6dbe7792f
背景
ZStack 已有当 kvmagent 物理内存超过 hardlimit 时自动重启的机制(由
processKvmagentPhysicalMemUsageAbnormal触发)。但目前没有时间约束,可能在业务高峰随时触发重启。本 MR 增加一个时间窗口配置,仅在窗口内允许自动重启。Resolves: ZSTAC-84618
变更
新增全局配置
kvm.kvmagent.autorestart.window:HH:MM-HH:MM24 小时制,服务器本地时间,例如02:00-04:0022:00-02:00触发逻辑
processKvmagentPhysicalMemUsageAbnormal在原有"超过 hardlimit"判断之后、发送RestartKvmAgentMsg之前,新增一个时间窗口关卡:KVMHost.restartKvmAgentOnHost链中的"无运行任务才重启" (noRunningTaskOnHost) 保持不变,所以最终守卫是:(在窗口内) ∧ (超 hardlimit) ∧ (host 无任务)。跳过时打 INFO 日志(含 hostUuid 和窗口字符串),方便运维排查。kvmagent 端 30 分钟一次告警上报,进入窗口后下一波告警自然触发重启,最坏延迟约 30 分钟。
配置校验
KVMHostFactory#start()注册GlobalConfigValidatorExtensionPoint,inline 校验(仿RESERVED_MEMORY_CAPACITY风格,不用 try/catch 包装),拒绝以下非法值:HH:MM-HH:MM影响范围
plugin/kvm),不涉及 kvmagent Python 端RestartKvmAgentMsg(运维主动重启)不受窗口限制测试
新增单元测试
TestKVMAutoRestartWindow(5 个用例,全部通过):00:00-23:59全天减一分钟边界文件变更
conf/globalConfig/kvm.xmltest/src/test/resources/globalConfig/kvm.xmlplugin/kvm/.../KVMGlobalConfig.javaKVMAGENT_AUTO_RESTART_WINDOWplugin/kvm/.../KVMHostFactory.javaisNowInAutoRestartWindowplugin/kvm/pom.xmlplugin/kvm/src/test/.../TestKVMAutoRestartWindow.javaGlobalConfigImpact: 新增
kvm.kvmagent.autorestart.windowsync from gitlab !9738