Skip to content

feat: node-exporter into vhd build#7704

Open
chmill-zz wants to merge 8 commits intomainfrom
nodeexportershift
Open

feat: node-exporter into vhd build#7704
chmill-zz wants to merge 8 commits intomainfrom
nodeexportershift

Conversation

@chmill-zz
Copy link
Contributor

What this PR does / why we need it:

this is adding node-exporter into the vhdbuild by default. At the end of the vhdbuild we systemctl disable to allow a fresh start during cse letting the node-exporter-startup script run and gather up node specific details needed.

Which issue(s) this PR fixes:

Fixes #

@chmill-zz chmill-zz changed the title Nodeexportershift feat: node-exporter into vhd build Jan 22, 2026
@github-actions github-actions bot added the components This pull request updates cached components on Linux or Windows VHDs label Jan 22, 2026
@chmill-zz chmill-zz force-pushed the nodeexportershift branch 2 times, most recently from 68a2704 to d494735 Compare January 23, 2026 21:59
@@ -0,0 +1,5 @@
tls_server_config:
cert_file: "/etc/kubernetes/certs/kubeletserver.crt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these paths aren't necessarily correct - they depend on whether kubelet serving certificate rotation is enabled - when it's disabled these paths are correct, however when it's enabled both cert_file and key_file should point towards: /var/lib/kubelet/pki/kubelet-server-current.pem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting this is copy paste from the aks-vm-extension repo and what is on my node today. I don't see any where it's touched, just a static file.

root@aks-sys-41317600-vmss000000:/etc/node-exporter.d# cat web-config.yml
tls_server_config:
  cert_file: "/etc/kubernetes/certs/kubeletserver.crt"
  key_file: "/etc/kubernetes/certs/kubeletserver.key"
  client_auth_type: "RequireAndVerifyClientCert"
  client_ca_file: "/etc/kubernetes/certs/ca.crt"

i think we could address this in the node-exporter-startup.sh and check for the existence of /var/lib/kubelet/pki/kubelet-server-current.pem and if it exists use it. And if not use /etc/kubernetes/certs/kubeletserver.crt

Copy link
Contributor

@cameronmeissner cameronmeissner Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's tricky - kubeelet-server-currnet.pem won't exist until after kubelet requests it from the control plane after CSE exists, though we can check IMDS during CSE to see whether the feature itself will be enabled (like we currently do in configureKubeletServing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alrighty. Adjusted the service to wait for kubelet. Startup script to check the imds cache file first. If that's not around then call imds ourselves. And at the very end check if /var/lib/kubelet/pki/kubelet-server-current.pem exists and everything else said no for some reason.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 90 changed files in this pull request and generated 5 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 90 changed files in this pull request and generated 3 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 90 changed files in this pull request and generated 5 comments.

Comment on lines +113 to +117
version_info=$(node_exporter_extract_package_version "${package_json}" "ubuntu" "current")
fi

IFS=':' read -r NODE_EXPORTER_VERSION NODE_EXPORTER_REVISION NODE_EXPORTER_UBUNTU_VERSION <<< "${version_info}"

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If jq/sed parsing ever fails (unexpected version format, missing JSON path, etc.), version_info can be empty and the subsequent IFS split will set NODE_EXPORTER_VERSION/REVISION to empty strings, leading to invalid download URLs/paths. Add a sanity check after computing version_info / after the split to fail fast with a clear error when parsing didn’t produce the expected fields.

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +342
# Skip for OSGuard, Flatcar, Kata, and Mariner (only AzureLinux 3.0 gets node-exporter)
if ! { isAzureLinuxOSGuard "$OS" "$OS_VARIANT" || isFlatcar "$OS" || grep -q "kata" <<< "$FEATURE_FLAGS" || isMariner "$OS"; }; then
cpAndMode $NODE_EXPORTER_STARTUP_SRC $NODE_EXPORTER_STARTUP_DEST 755
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says "only AzureLinux 3.0 gets node-exporter", but this block also runs on Ubuntu (it only skips OSGuard/Flatcar/Kata/Mariner). Please update the comment to match the actual install/copy behavior so future readers don't assume Ubuntu is excluded.

Copilot uses AI. Check for mistakes.
Comment on lines +1496 to +1499
# Skip check for OS variants that don't have node-exporter, but verify the skip file is NOT present
# Mariner/CBLMariner is skipped - only AzureLinux 3.0 gets node-exporter
if [ "$os_sku" = "AzureLinuxOSGuard" ] || [ "$os_sku" = "Flatcar" ] || [ "$os_sku" = "CBLMariner" ] || echo "$FEATURE_FLAGS" | grep -q "kata"; then
local skip_file_check="/etc/node-exporter.d/skip_vhd_node_exporter"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says Mariner/CBLMariner is skipped and "only AzureLinux 3.0 gets node-exporter", but this test (and the VHD build scripts) also expect node-exporter on Ubuntu. Please correct the comment to reflect the actual supported OSes to avoid confusion when adjusting the skip logic later.

Copilot uses AI. Check for mistakes.
Comment on lines +480 to +483
# Skip for Flatcar, OSGuard, Kata, and Mariner (we only build AzureLinuxV3 now, mariner entry removed from components.json)
if isFlatcar "$OS" || isAzureLinuxOSGuard "$OS" "$OS_VARIANT" || [ "${IS_KATA}" = "true" ] || [ "$OS" = "$MARINER_OS_NAME" ]; then
echo "Skipping node-exporter installation for ${OS} ${OS_VARIANT:-default} (IS_KATA=${IS_KATA})"
else
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment implies node-exporter is only for AzureLinuxV3/and that the Mariner entry was removed, but the "node-exporter" component is also defined for Ubuntu in components.json and is installed on Ubuntu builds. Please update the comment so it doesn't contradict the actual behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +109
echo "WARNING: No kubelet serving certs found after ${WAIT_TIMEOUT}s, node-exporter will run without TLS. Restart the service after certs are available to enable TLS."
fi

# Configure TLS if we found valid cert paths
if [ -n "$CERT_FILE" ] && [ -n "$KEY_FILE" ]; then
cat > "$TLS_CONFIG_PATH" <<EOF
tls_server_config:
cert_file: "$CERT_FILE"
key_file: "$KEY_FILE"
client_auth_type: "RequireAndVerifyClientCert"
client_ca_file: "/etc/kubernetes/certs/ca.crt"
EOF
TLS_CONFIG_ARG="--web.config.file=${TLS_CONFIG_PATH}"
fi

ARGS=(
--web.listen-address="${NODE_IP}:19100"
--no-collector.wifi
--no-collector.hwmon
--collector.cpu.info
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|run/containerd/.+|var/lib/docker/.+|var/lib/kubelet/.+)($|/)"
--collector.netclass.ignored-devices="^(azv.*|veth.*|[a-f0-9]{15})$"
--collector.netclass.netlink
--collector.netdev.device-exclude="^(azv.*|veth.*|[a-f0-9]{15})$"
--no-collector.arp.netlink
)

if [ -n "$TLS_CONFIG_ARG" ]; then
ARGS+=("$TLS_CONFIG_ARG")
fi

exec /opt/bin/node-exporter "${ARGS[@]}"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node-exporter-startup.sh falls back to starting node-exporter without any TLS or client authentication if kubelet serving certs are not found within the wait timeout. In that case, node-exporter listens on ${NODE_IP}:19100 with plaintext HTTP, exposing detailed node metrics to any client that can reach that IP/port (e.g., pods or VNet peers), which enables host reconnaissance and information disclosure. To avoid this, consider failing or delaying service startup until TLS certs are available (or binding only to localhost in the fallback) so that node-exporter is never exposed unauthenticated on the network.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants