Skip to content

<feature>[kvm]: add kvmagent auto-restart window config#3863

Open
zstack-robot-2 wants to merge 2 commits into5.5.16from
sync/jin.ma/feat/kvmagent-autorestart-window
Open

<feature>[kvm]: add kvmagent auto-restart window config#3863
zstack-robot-2 wants to merge 2 commits into5.5.16from
sync/jin.ma/feat/kvmagent-autorestart-window

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

背景

ZStack 已有当 kvmagent 物理内存超过 hardlimit 时自动重启的机制(由 processKvmagentPhysicalMemUsageAbnormal 触发)。但目前没有时间约束,可能在业务高峰随时触发重启。本 MR 增加一个时间窗口配置,仅在窗口内允许自动重启。

Resolves: ZSTAC-84618

变更

新增全局配置 kvm.kvmagent.autorestart.window

  • 格式:HH:MM-HH:MM 24 小时制,服务器本地时间,例如 02:00-04:00
  • 支持跨午夜窗口,例如 22:00-02:00
  • 默认空字符串 = 任意时间允许(与升级前行为一致)

触发逻辑

processKvmagentPhysicalMemUsageAbnormal 在原有"超过 hardlimit"判断之后、发送 RestartKvmAgentMsg 之前,新增一个时间窗口关卡:

1. memoryUsage > hardlimit?           (已有)
2. 当前时间在窗口内?                   (新增)
3. 发送 RestartKvmAgentMsg            (已有)

KVMHost.restartKvmAgentOnHost 链中的"无运行任务才重启" (noRunningTaskOnHost) 保持不变,所以最终守卫是:(在窗口内) ∧ (超 hardlimit) ∧ (host 无任务)

跳过时打 INFO 日志(含 hostUuid 和窗口字符串),方便运维排查。kvmagent 端 30 分钟一次告警上报,进入窗口后下一波告警自然触发重启,最坏延迟约 30 分钟。

配置校验

KVMHostFactory#start() 注册 GlobalConfigValidatorExtensionPoint,inline 校验(仿 RESERVED_MEMORY_CAPACITY 风格,不用 try/catch 包装),拒绝以下非法值:

  • 格式不是 HH:MM-HH:MM
  • 小时 > 23 或分钟 > 59
  • 起止时间相同(零长度窗口)

影响范围

  • 仅 Java 改动(plugin/kvm),不涉及 kvmagent Python 端
  • 仅影响告警触发的自动重启路径;显式 RestartKvmAgentMsg(运维主动重启)不受窗口限制
  • 无 DB schema 改动,无 Flyway 迁移
  • 默认值空 → 老环境零感知升级

测试

新增单元测试 TestKVMAutoRestartWindow(5 个用例,全部通过):

  • 空/null/空白值 → 任意时间允许
  • 普通窗口成员判断 + 起点包含 / 终点排除
  • 跨午夜窗口(前半段、后半段、外部、边界)
  • 00:00-23:59 全天减一分钟边界
mvn test -pl plugin/kvm -am -Dtest=TestKVMAutoRestartWindow
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

文件变更

文件 变更
conf/globalConfig/kvm.xml 新增配置项 + 默认空
test/src/test/resources/globalConfig/kvm.xml 同上(测试镜像)
plugin/kvm/.../KVMGlobalConfig.java 新增常量 KVMAGENT_AUTO_RESTART_WINDOW
plugin/kvm/.../KVMHostFactory.java gate + validator + isNowInAutoRestartWindow
plugin/kvm/pom.xml 加 junit test scope 依赖
plugin/kvm/src/test/.../TestKVMAutoRestartWindow.java 新增单元测试

GlobalConfigImpact: 新增 kvm.kvmagent.autorestart.window

sync from gitlab !9738

Add KVMAGENT_AUTO_RESTART_WINDOW (kvm.kvmagent.autorestart.window) so
the automatic kvmagent restart triggered by the physical-memory
hard-limit alarm only fires within a configured daily time window.

Format: HH:MM-HH:MM in 24-hour server local time, e.g. 02:00-04:00.
Cross-midnight windows are supported, e.g. 22:00-02:00. Empty
default means always allowed (preserves existing behavior on upgrade).

The gate is added in processKvmagentPhysicalMemUsageAbnormal()
between the existing hard-limit check and the RestartKvmAgentMsg send.
The 'no running task on host' check inside restartKvmAgentOnHost is
unchanged, so the final guard is: in-window AND over-hardlimit AND
no-host-tasks.

A GlobalConfigValidatorExtensionPoint is registered in start() to
reject malformed values inline, following the pattern used by
RESERVED_MEMORY_CAPACITY (no try/catch wrappers).

Includes unit tests for the window-membership function covering
empty/null, normal window, half-open boundary, cross-midnight, and
00:00-23:59 edge cases.

Resolves: ZSTAC-84618

Change-Id: I872bfe96fe30cb83dec21d40157bb315966978ba
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)
  • conf/globalConfig/kvm.xml is excluded by !**/*.xml
  • test/src/test/resources/globalConfig/kvm.xml is excluded by !**/*.xml

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 886ab73b-7f8c-4860-958a-6bfbe441a5ed

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

该PR为KVM代理的物理内存硬限制自动重启功能添加了时间窗口控制。新增全局配置参数kvmagent.autorestart.window,允许在指定时间范围内执行自动重启操作,支持跨午夜的时间段配置。

Changes

Cohort / File(s) Summary
全局配置新增
plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java
新增KVMAGENT_AUTO_RESTART_WINDOW全局配置字段,键名为kvmagent.autorestart.window,用于定义自动重启的时间窗口。
时间窗口逻辑实现
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
实现isNowInAutoRestartWindow方法,用于检查当前时间是否在配置的窗口内。支持HH:MM-HH:MM格式,包含跨午夜时间段处理和全局配置验证器注册。
单元测试覆盖
plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java
新增5个测试方法,覆盖空值处理、同天时间范围、跨午夜范围、边界条件等场景。

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 时间之窗现身来,
自动重启有了界,
午夜之前莫惊扰,
配置灵活又周全,
测试覆盖真周密!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题遵循了指定的格式 [scope]: ,且为55个字符,符合72字符以内的要求,清晰描述了主要变更(添加kvmagent自动重启窗口配置)。
Description check ✅ Passed PR描述详细说明了背景、变更、触发逻辑、配置校验、影响范围、测试和文件变更,与代码改动的内容高度关联且信息完整。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/jin.ma/feat/kvmagent-autorestart-window

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java`:
- Around line 152-153: The GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW currently
uses the bare `@GlobalConfigValidation` which enforces notNull=true/notEmpty=true
and thus prevents using null/empty to mean "no time window"; update the
annotation on KVMAGENT_AUTO_RESTART_WINDOW to explicitly allow null/empty (e.g.,
set notNull=false and notEmpty=false) so users can clear the configured window
to mean "always allowed", and ensure any downstream validation logic that reads
KVMAGENT_AUTO_RESTART_WINDOW honors null/empty as "no restriction".

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java`:
- Around line 479-490: The method isNowInAutoRestartWindow currently calls
LocalTime.parse on configValue parts without validation, which can throw and
break alert flows for dirty config; update isNowInAutoRestartWindow to validate
parts.length == 2 and wrap parsing in a try-catch that catches
DateTimeParseException/RuntimeException, emit a logger.warn including the
offending configValue and exception, and return false (treat as outside window)
on any parse/validation failure so malformed values do not throw.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 688e3a12-d6ba-4cba-a18d-4750fa67950d

📥 Commits

Reviewing files that changed from the base of the PR and between e6cdf84 and c7f332b.

⛔ Files ignored due to path filters (3)
  • conf/globalConfig/kvm.xml is excluded by !**/*.xml
  • plugin/kvm/pom.xml is excluded by !**/*.xml
  • test/src/test/resources/globalConfig/kvm.xml is excluded by !**/*.xml
📒 Files selected for processing (3)
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
  • plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java

Comment on lines +152 to +153
@GlobalConfigValidation
public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

空值语义与当前注解默认校验存在冲突。

这里使用裸 @GlobalConfigValidation 会沿用默认 notNull=true/notEmpty=true,与“空值表示不限时段”的需求冲突;用户也无法把已配置窗口恢复为“始终允许”。

💡 建议修复
-    `@GlobalConfigValidation`
-    public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window");
+    `@GlobalConfigValidation`(notNull = false, notEmpty = false)
+    `@GlobalConfigDef`(
+            defaultValue = "",
+            description = "daily local-time window for kvmagent auto restart, format HH:MM-HH:MM; empty means always allowed"
+    )
+    public static GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW = new GlobalConfig(CATEGORY, "kvmagent.autorestart.window");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java` around lines
152 - 153, The GlobalConfig KVMAGENT_AUTO_RESTART_WINDOW currently uses the bare
`@GlobalConfigValidation` which enforces notNull=true/notEmpty=true and thus
prevents using null/empty to mean "no time window"; update the annotation on
KVMAGENT_AUTO_RESTART_WINDOW to explicitly allow null/empty (e.g., set
notNull=false and notEmpty=false) so users can clear the configured window to
mean "always allowed", and ensure any downstream validation logic that reads
KVMAGENT_AUTO_RESTART_WINDOW honors null/empty as "no restriction".

Comment on lines +479 to +490
static boolean isNowInAutoRestartWindow(String configValue, LocalTime now) {
if (configValue == null || configValue.trim().isEmpty()) {
return true;
}
String[] parts = configValue.trim().split("-");
LocalTime start = LocalTime.parse(parts[0]);
LocalTime end = LocalTime.parse(parts[1]);
if (start.isBefore(end)) {
return !now.isBefore(start) && now.isBefore(end);
}
return !now.isBefore(start) || now.isBefore(end);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 检查窗口解析实现是否存在无兜底解析 =="
rg -n -A20 -B5 'static boolean isNowInAutoRestartWindow|LocalTime\.parse|split\("-"\)' plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java

echo
echo "== 检查测试是否覆盖非法配置输入场景 =="
rg -n 'inWindow\(|assert(True|False)|Exception' plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java
rg -n '25:00|aa:bb|invalid|malformed|02:00-02:00|02:00-' plugin/kvm/src/test/java/org/zstack/kvm/TestKVMAutoRestartWindow.java || true

Repository: MatheMatrix/zstack

Length of output: 5207


添加错误处理防止脏配置导致告警流程中断。

当前代码直接解析时间字符串而无任何错误处理;虽然全局配置验证器可阻止新的非法值写入数据库,但无法防护:

  • 验证器添加前已存在的历史脏数据
  • 绕过 API 直接修改数据库的数据
  • 迁移脚本或手工编辑引入的非法数据

若这些脏数据在生产环境触发 LocalTime.parse(),会抛出运行时异常并中断告警处理流程,影响稳定性。建议在解析失败时记录 warn 并按窗口外处理(返回 false)。

建议修复
     static boolean isNowInAutoRestartWindow(String configValue, LocalTime now) {
         if (configValue == null || configValue.trim().isEmpty()) {
             return true;
         }
-        String[] parts = configValue.trim().split("-");
-        LocalTime start = LocalTime.parse(parts[0]);
-        LocalTime end = LocalTime.parse(parts[1]);
+        String[] parts = configValue.trim().split("-");
+        if (parts.length != 2) {
+            logger.warn(String.format("invalid auto-restart window config[%s], treat as out-of-window", configValue));
+            return false;
+        }
+
+        final LocalTime start;
+        final LocalTime end;
+        try {
+            start = LocalTime.parse(parts[0].trim());
+            end = LocalTime.parse(parts[1].trim());
+        } catch (Exception e) {
+            logger.warn(String.format("invalid auto-restart window config[%s], treat as out-of-window", configValue), e);
+            return false;
+        }
+
         if (start.isBefore(end)) {
             return !now.isBefore(start) && now.isBefore(end);
         }
         return !now.isBefore(start) || now.isBefore(end);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java` around lines 479
- 490, The method isNowInAutoRestartWindow currently calls LocalTime.parse on
configValue parts without validation, which can throw and break alert flows for
dirty config; update isNowInAutoRestartWindow to validate parts.length == 2 and
wrap parsing in a try-catch that catches
DateTimeParseException/RuntimeException, emit a logger.warn including the
offending configValue and exception, and return false (treat as outside window)
on any parse/validation failure so malformed values do not throw.

Empty default disabled the time gate by accident; ship a sane
maintenance window so auto-restart only fires during low-traffic
hours out of the box. Operators who want 24/7 restart can clear
the value.

Resolves: ZSTAC-84618

Change-Id: I807bc32502780525015a8beffdb06ca6dbe7792f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants