fix(vmcp): detect pod volume drift via hash annotation by syf2211 · Pull Request #5643 · stacklok/toolhive

syf2211 · 2026-06-25T16:40:15Z

Summary

Add hash-annotation-based drift detection for operator-managed pod volumes and volumeMounts in VirtualMCPServer reconciler, mirroring the existing imagePullSecretsNeedsUpdate pattern.

Motivation

Fixes #5619. deploymentNeedsUpdate compared container args/env and config checksums but never volumes or volumeMounts. Changes that only affect volumes (e.g. rotating an auth-server signing key secret ref from keys-v1 to keys-v2) were silently missed — the operator reported no drift and pods kept stale mounts.

Changes

Add buildPodVolumesForVmcp shared by the deploy builder and drift check
Add podVolumesHash with canonical source fingerprinting (secret/configmap names + items; ignores K8s-defaulted fields like defaultMode)
Store hash in podVolumesHashAnnotation on Deployment metadata
Add podVolumesNeedsUpdate to the deploymentNeedsUpdate chain
Regression tests for hash helper and auth-server signing key secret ref rotation

Tests

go test ./cmd/thv-operator/controllers/ -run 'TestPodVolumesHash|TestDeploymentForVirtualMCPServer_PodVolumes|TestBuildDeploymentMetadataForVmcp|TestVirtualMCPServerDeploymentNeedsUpdate' -count=1
go test ./cmd/thv-operator/controllers/ -count=1

All tests pass.

Notes

PodTemplateSpec volume edits remain covered by the existing podTemplateSpecNeedsUpdate raw-JSON hash
Secret data rotation with the same ref is out of scope (same as env secretKeyRef behavior)
Existing Deployments without the annotation will get one rollout on upgrade (expected)

Mirror the imagePullSecrets hash-annotation pattern so changes to operator-managed volumes and volumeMounts (auth-server signing keys, CA-bundle refs, telemetry CA bundles) trigger a Deployment rollout. Adds buildPodVolumesForVmcp shared by the deploy builder and drift check, podVolumesHash with canonical source fingerprinting that ignores K8s-defaulted fields like defaultMode, and regression tests.

codecov · 2026-06-25T20:45:11Z

Codecov Report

❌ Patch coverage is 67.02128% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.25%. Comparing base (74d3523) to head (e641710).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...perator/controllers/virtualmcpserver_deployment.go	72.72%	16 Missing and 5 partials ⚠️
...perator/controllers/virtualmcpserver_controller.go	41.17%	6 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5643      +/-   ##
==========================================
- Coverage   70.33%   70.25%   -0.08%     
==========================================
  Files         648      648              
  Lines       66011    66089      +78     
==========================================
+ Hits        46430    46433       +3     
- Misses      16225    16306      +81     
+ Partials     3356     3350       -6

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChrisJBurns

Multi-Agent Consensus Review

Agents consulted: Kubernetes operator patterns, Go code quality, test coverage, general code quality

Consensus Summary

#	Finding	Consensus	Severity	Action
F1	Silent error discard in `buildDeploymentMetadataForVmcp`	8/10	MEDIUM	Fix
F2	Double computation of `buildPodVolumesForVmcp` per deployment build	9/10	MEDIUM	Fix
F3	API calls inside drift-detection chain (`podVolumesNeedsUpdate`)	8/10	MEDIUM	Discuss
F4	Annotation removal blocked by `MergeAnnotations` — latent update loop	7/10	MEDIUM	Discuss
F5	"Expected update" test cases trigger drift for the wrong reason	8/10	LOW	Fix
F6	`volumeSourceFingerprint` default branch misses backing-ref changes	7/10	LOW	Discuss
F7	`EmptyDir` fingerprint ignores `Medium` and `SizeLimit`	7/10	LOW	Discuss

Overall

The design is correct and well-chosen for this codebase. The hash-annotation pattern mirrors imagePullRefsHashAnnotation exactly, the extraction of buildPodVolumesForVmcp as a shared builder used by both the deploy path and the drift check is exactly the right factoring, and the end-to-end regression test covers the primary failure mode from #5619.

F1 and F2 are the most impactful items before merge. The buildDeploymentMetadataForVmcp now calls buildPodVolumesForVmcp (which can make real API calls via listMCPServerEntriesAsMap) without logging errors when it fails — this violates the operator rule that advisory enrichment "may log-and-continue with a comment saying why". The fix is a one-liner. F2 is a structural issue: volumes are computed twice per deployment build (once directly in deploymentForVirtualMCPServer, once inside buildDeploymentMetadataForVmcp) introducing a redundant API call and a narrow TOCTOU window where the annotation hash could diverge from the volumes actually written to the spec. Computing volumes once and passing the result (or pre-computed hash) into buildDeploymentMetadataForVmcp resolves both.

F3 and F4 are pattern-level observations worth tracking as follow-up issues rather than blocking this PR — they reflect structural constraints that pre-exist this change and are replicated faithfully. F5 is a low-effort test hygiene fix: the three "expected update" table cases still use make(map[string]string) for Annotations, so podVolumesNeedsUpdate fires before the test's intended drift check is reached; updating them to expectedDeploymentAnnotations as a base and mutating only the declared field isolates each case correctly.

Generated with Claude Code

ChrisJBurns · 2026-06-26T13:53:51Z

+	// refs, CA-bundle refs, telemetry CA bundles) without comparing live
+	// PodSpec.Volumes, which trips on K8s-defaulted fields (see #5619).
+	if volumeMounts, volumes, err := r.buildPodVolumesForVmcp(ctx, vmcp, telemetryCfg, typedWorkloads); err == nil {
+		if hash, err := podVolumesHash(volumeMounts, volumes); err == nil && hash != "" {


[F1 — MEDIUM] Silent error discard in buildDeploymentMetadataForVmcp (Consensus: 8/10)

Both errors (buildPodVolumesForVmcp and podVolumesHash) are silently discarded with no log call. Unlike imagePullSecretsHash (pure, cannot fail transiently), buildPodVolumesForVmcp makes real API calls (listMCPServerEntriesAsMap) that can fail transiently. When they do, the annotation is never set, and podVolumesNeedsUpdate returns true on every reconcile until the annotation is written — a quiet update loop for the duration of the outage. The operator rule allows advisory enrichment to skip but requires a log call with a comment explaining why skipping is safe.

Suggested change

if hash, err := podVolumesHash(volumeMounts, volumes); err == nil && hash != "" {

if volumeMounts, volumes, err := r.buildPodVolumesForVmcp(ctx, vmcp, telemetryCfg, typedWorkloads); err != nil {

// Advisory enrichment — skipping is safe: podVolumesNeedsUpdate will return true

// until the annotation is set, causing at most one extra rollout per transient error.

log.FromContext(ctx).Error(err, "Failed to compute pod volumes hash annotation, skipping")

} else if hash, err := podVolumesHash(volumeMounts, volumes); err != nil {

log.FromContext(ctx).Error(err, "Failed to hash pod volumes, skipping annotation")

} else if hash != "" {

deploymentAnnotations[podVolumesHashAnnotation] = hash

}

Raised by: Go code quality agent, General code quality agent

ChrisJBurns · 2026-06-26T13:53:51Z

@@ -147,7 +155,7 @@ func (r *VirtualMCPServerReconciler) deploymentForVirtualMCPServer(

 	// Build deployment components using helper functions
 	args := r.buildContainerArgsForVmcp(vmcp)


[F2 — MEDIUM] Double computation of buildPodVolumesForVmcp per deployment build (Consensus: 9/10)

buildPodVolumesForVmcp is called here (line 157) to populate the pod spec, and then again inside buildDeploymentMetadataForVmcp (called a few lines below) to compute the hash annotation. Since buildPodVolumesForVmcp calls buildCABundleVolumesForEntries → listMCPServerEntriesAsMap (a Kubernetes API List), every deployment build now makes that API call twice with identical inputs. There's also a narrow TOCTOU window: if an MCPServerEntry list changes between the two calls, the annotation hash won't match the volumes actually written to the spec.

Suggested approach: compute volumes once here, pass the pre-computed slices (or just the pre-computed hash) into buildDeploymentMetadataForVmcp so both uses come from the same result.

Raised by: Kubernetes operator patterns agent, Go code quality agent, General code quality agent

ChrisJBurns · 2026-06-26T13:53:51Z

+		return true
+	}
+
+	volumeMounts, volumes, err := r.buildPodVolumesForVmcp(ctx, vmcp, telemetryCfg, typedWorkloads)


[F3 — MEDIUM] API calls inside the drift-detection chain (Consensus: 8/10)

Every other function in the deploymentNeedsUpdate chain (containerNeedsUpdate, imagePullSecretsNeedsUpdate, podTemplateSpecNeedsUpdate) is a pure function of already-fetched state. podVolumesNeedsUpdate calls buildPodVolumesForVmcp here, which transitively makes API calls (listMCPServerEntriesAsMap, potentially GetOIDCConfigForServer). This breaks the "read once, compute, converge" pattern from the operator rules and adds per-reconcile API load proportional to OIDC and MCPServerEntry configuration.

The data is already available from earlier reconcile steps — typedWorkloads and telemetryCfg already flow through. The remaining hidden call is the OIDC config Get inside buildVolumesForVmcp. Pre-fetching it in the main reconcile loop (as is done for telemetryCfg) and threading it as a parameter would make this function pure. Consider tracking as a follow-up issue.

Raised by: Kubernetes operator patterns agent, General code quality agent

ChrisJBurns · 2026-06-26T13:53:51Z

+
+	_, present := deployment.Annotations[podVolumesHashAnnotation]
+	if expectedHash == "" {
+		return present


[F4 — MEDIUM] Annotation removal blocked by MergeAnnotations — latent update loop (Consensus: 7/10)

When expectedHash == "" (no volumes), this returns present — signalling drift if the annotation exists on the live deployment. However, the update path uses MergeAnnotations(newDeployment.Annotations, deployment.Annotations) where deployment.Annotations acts as an override (takes precedence for keys not in newDeployment.Annotations). If newDeployment.Annotations omits the key (because the hash is empty), the live annotation is preserved through the merge and never removed — so drift is detected again on the next reconcile. Same latent bug exists for imagePullRefsHashAnnotation.

In practice this is unreachable today because buildVolumesForVmcp always adds the vmcp-config volume, so the hash is never empty. But it's worth a defensive comment here and a fix in the update path to explicitly delete the key when the new deployment omits it. Consider tracking alongside the existing imagePullRefsHashAnnotation issue.

Raised by: Kubernetes operator patterns agent

ChrisJBurns · 2026-06-26T13:53:51Z

 				ObjectMeta: metav1.ObjectMeta{
 					Labels:      labelsForVirtualMCPServer(vmcp.Name),
-					Annotations: make(map[string]string),
+					Annotations: expectedDeploymentAnnotations,


[F5 — LOW] "Expected update" test cases trigger drift for the wrong reason (Consensus: 8/10)

This line correctly fixes the "no changes — no update needed" case to use expectedDeploymentAnnotations. However, the other expectedUpdate: true table cases ("deployment metadata changed", "pod template metadata changed", "container changed") still use Annotations: make(map[string]string) — which lacks podVolumesHashAnnotation. With the new drift check wired in, podVolumesNeedsUpdate fires first and returns true before the test's declared drift cause is even reached. Tests still pass (since expectedUpdate: true), but each test is now proving the wrong thing.

Update the sibling expectedUpdate: true cases to also use expectedDeploymentAnnotations as the base, then mutate only the field the test declares (e.g., wrong image, wrong labels). This ensures each case isolates and exercises exactly the drift check it claims to test.

Raised by: Test coverage agent, Kubernetes operator patterns agent, General code quality agent

ChrisJBurns · 2026-06-26T13:53:51Z

+		return src
+	case vol.EmptyDir != nil:
+		return "emptydir"
+	default:


[F6 — LOW] volumeSourceFingerprint default branch misses backing-ref changes for unhandled volume types (Consensus: 7/10)

The default returns "unknown:" + vol.Name, capturing identity (name) but not the backing resource reference. If a new volume type is added to buildPodVolumesForVmcp without a corresponding case here — e.g. a Projected volume combining multiple secrets — a change to that volume's backing refs won't change the fingerprint and drift will be silently missed. Given this file's history with silent drift bugs (#5616, #5619), consider adding a defensive comment:

Suggested change

default:

default:

// WARNING: This volume type is not fingerprinted by backing resource ref.

// Adding a new volume source to buildPodVolumesForVmcp MUST add a corresponding

// case here; this fallback detects the volume by name only and will not detect

// backing-ref changes (e.g. rotating a secret name).

return "unknown:" + vol.Name

Raised by: Kubernetes operator patterns agent, General code quality agent

ChrisJBurns · 2026-06-26T13:53:51Z

+			src += ":" + strings.Join(items, ",")
+		}
+		return src
+	case vol.EmptyDir != nil:


[F7 — LOW] EmptyDir fingerprint ignores Medium and SizeLimit (Consensus: 7/10)

Two EmptyDir volumes with different Medium (e.g. StorageMediumMemory vs default) or SizeLimit would produce identical fingerprints if they share a name. Since volume names are unique per pod, the name in the hash payload still distinguishes different EmptyDirs — but a change to an existing volume's Medium or SizeLimit (without a name change) would go undetected. This is inconsistent with the specificity applied to ConfigMap and Secret fingerprints.

Suggested change

case vol.EmptyDir != nil:

case vol.EmptyDir != nil:

sizeStr := "0"

if vol.EmptyDir.SizeLimit != nil {

sizeStr = vol.EmptyDir.SizeLimit.String()

}

return fmt.Sprintf("emptydir:medium=%s:limit=%s", vol.EmptyDir.Medium, sizeStr)

Raised by: Go code quality agent, General code quality agent

syf2211 requested review from ChrisJBurns, JAORMX, blkt, jerm-dro, jhrozek, rdimitrov, reyortiz3 and tgrunnagle as code owners June 25, 2026 16:40

github-actions Bot added the size/M Medium PR: 300-599 lines changed label Jun 25, 2026

ChrisJBurns approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vmcp): detect pod volume drift via hash annotation#5643

fix(vmcp): detect pod volume drift via hash annotation#5643
syf2211 wants to merge 1 commit into
stacklok:mainfrom
syf2211:fix/5619-volume-drift-detection

syf2211 commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026

Uh oh!

ChrisJBurns left a comment

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

ChrisJBurns Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-		if hash, err := podVolumesHash(volumeMounts, volumes); err == nil && hash != "" {
+	if volumeMounts, volumes, err := r.buildPodVolumesForVmcp(ctx, vmcp, telemetryCfg, typedWorkloads); err != nil {
+		// Advisory enrichment — skipping is safe: podVolumesNeedsUpdate will return true
+		// until the annotation is set, causing at most one extra rollout per transient error.
+		log.FromContext(ctx).Error(err, "Failed to compute pod volumes hash annotation, skipping")
+	} else if hash, err := podVolumesHash(volumeMounts, volumes); err != nil {
+		log.FromContext(ctx).Error(err, "Failed to hash pod volumes, skipping annotation")
+	} else if hash != "" {
+		deploymentAnnotations[podVolumesHashAnnotation] = hash
+	}

		@@ -147,7 +155,7 @@ func (r *VirtualMCPServerReconciler) deploymentForVirtualMCPServer(

		// Build deployment components using helper functions
		args := r.buildContainerArgsForVmcp(vmcp)

-	default:
+	default:
+		// WARNING: This volume type is not fingerprinted by backing resource ref.
+		// Adding a new volume source to buildPodVolumesForVmcp MUST add a corresponding
+		// case here; this fallback detects the volume by name only and will not detect
+		// backing-ref changes (e.g. rotating a secret name).
+		return "unknown:" + vol.Name

-	case vol.EmptyDir != nil:
+	case vol.EmptyDir != nil:
+		sizeStr := "0"
+		if vol.EmptyDir.SizeLimit != nil {
+			sizeStr = vol.EmptyDir.SizeLimit.String()
+		}
+		return fmt.Sprintf("emptydir:medium=%s:limit=%s", vol.EmptyDir.Medium, sizeStr)

Uh oh!

Conversation

syf2211 commented Jun 25, 2026

Summary

Motivation

Changes

Tests

Notes

Uh oh!

codecov Bot commented Jun 25, 2026

Codecov Report

Uh oh!

ChrisJBurns left a comment

Choose a reason for hiding this comment

Multi-Agent Consensus Review

Consensus Summary

Overall

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants