Main reduce hp#1142
Conversation
Previously hugepages were only increased when below the required amount. Now the value is set whenever current != required, allowing reduction when the node has more hugepages allocated than needed. Kubelet restart in the storage init job is also triggered on any change, not just increases.
When sn restart is called with a new max_lvol, the sn_config_file on the
node was not updated and hugepages were not adjusted to match the new
requirement.
- Add update_node_config API endpoint (docker + k8s) that updates
max_lvol and huge_page_memory for the matching NUMA node in
sn_config_file, matched by socket + SSD PCI intersection
- Add update_node_config client method in SNodeClient
- In _restart_storage_node_impl, call update_node_config before
set_hugepages() only when max_lvol actually changed (lvol_changed flag
captures the delta before snode.max_lvol is overwritten)
- Fix hugepage drift bug in set_hugepages_if_needed: store the adjusted
kernel total (required) instead of raw hugepages_needed so that
user_delta = current - prev correctly reflects only manual changes
When a user adds or removes hugepages manually between deploys, the delta is captured correctly for the first restart but then silently lost on the second: prev_sb advances to the new required (which already absorbed the delta), so user_delta becomes 0 and the baseline wins � reverting to the pre-change count. Fix by writing the updated user reservation (baseline + delta) back to hugepages_baseline_nodeN whenever user_delta != 0. The baseline then reflects the user's intended permanent allocation, and each subsequent restart computes the correct total without needing a manual re-add.
/tmp is subject to periodic cleanup by systemd-tmpfiles-clean (default 10-day TTL), which could silently delete hugepages_sb_node* and hugepages_baseline_node* between deployments without a reboot, causing the baseline tracking to reset incorrectly. /var/run (/run) is a tmpfs that clears only on reboot � the correct behaviour for these files � and is never touched by tmpfiles cleanup. Updates the constant, docker volume mount, k8s hostPath mounts, and the init job shell script.
mxsrc
left a comment
There was a problem hiding this comment.
Largely this looks good. A few comments:
I'd like to use a consistent naming of numa_node. Instead of node or numa. "node" in this codebase commonly refers to storage nodes, I feel we should avoid any kind of ambiguity.
As for the general design, I think the hugepages config/state file(s) could be simpler represented in a json file, that would make things a bit more self-descriptive, allow to validate the values, and make the logic more readable. Not sure you have the capacity for that sort of change right now though.
Failures in storing the hugepage allocation are ignored, I feel they should be exposed to the caller for them to decide how to handle them.
What's the difference between the new update_node_config endpoint and the old apply_config? It seems like there is overlap, at least concerning the intent of both. Can they be merged?
| max_lvol: int | ||
| huge_page_memory: int | ||
| numa: int | ||
| ssd_list: list |
There was a problem hiding this comment.
This can be typed better I feel. We should at least use List[str] for the ssd_list. In other places of the API we use ssd_pcie, does this list also contain a list of PCIe addresses? Then I'd argue we should name this consistenly. For the other values we should at least use util.Unsigned as the values cannot be negative. It might also make sense to make the values optional to allow partial updates.
| SIMPLY_BLOCK_DOCKER_IMAGE=simplyblock/simplyblock:main | ||
| SIMPLY_BLOCK_SPDK_ULTRA_IMAGE=simplyblock/spdk:main-latest |
| @api.post('/update_node_config', responses={ | ||
| 200: {'content': {'application/json': {'schema': utils.response_schema({ | ||
| 'type': 'boolean' | ||
| })}}}, | ||
| }) | ||
| def update_node_config(body: UpdateNodeConfigParams): | ||
| node_info = core_utils.load_config(constants.NODES_CONFIG_FILE) | ||
| if not node_info.get("nodes"): | ||
| return utils.get_response(False, "Config not found") | ||
|
|
||
| ssd_set = set(body.ssd_list) | ||
| matched = False | ||
| for node_config in node_info["nodes"]: | ||
| if node_config["socket"] != body.numa: | ||
| continue | ||
| if not ssd_set.intersection(set(node_config.get("ssd_pcis", []))): | ||
| continue | ||
| node_config["max_lvol"] = body.max_lvol | ||
| node_config["huge_page_memory"] = body.huge_page_memory | ||
| matched = True | ||
| break | ||
|
|
||
| if not matched: | ||
| return utils.get_response(False, "No matching node found for given numa and ssd_list") | ||
|
|
||
| core_utils.store_config_file(node_info, constants.NODES_CONFIG_FILE) | ||
| return utils.get_response(True) | ||
|
|
There was a problem hiding this comment.
This duplicates the docker endpoint, we should define it once and reuse it. Check the bottom of the file for how that's done.
No description provided.