Skip to content

Draft: minimal stuck GCSFuse pod retry prototype#285

Closed
Volv-G wants to merge 1 commit into
masterfrom
piforge/stuck-gcsfuse-pod-retry-prototype
Closed

Draft: minimal stuck GCSFuse pod retry prototype#285
Volv-G wants to merge 1 commit into
masterfrom
piforge/stuck-gcsfuse-pod-retry-prototype

Conversation

@Volv-G

@Volv-G Volv-G commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Draft PR to gauge the team's opinion on a minimal stuck-pod retry prototype. This is intentionally narrow and not ready to merge without team review/feedback.

This prototype:

  • Adds a pure classifier for a specific standalone Kubernetes Pod setup failure observed in Oasis.
  • Retries by deleting the stuck standalone Pod and creating one replacement Pod from a sanitized copy of the current Pod template.
  • Persists minimal retry metadata in serialized launcher data under launcher_data["kubernetes"]["pod_retry"].
  • Lets the existing refresh/orchestrator path persist the replacement Pod identity and continue treating the execution as pending/running.

Safety constraints / scope

  • Standalone Pods only: classifier rejects Pods with ownerReferences and this does not alter Job retry behavior.
  • Narrow allowlist only:
    • Pod phase Pending
    • main container waiting reason CreateContainerConfigError
    • waiting message contains both failed to prepare subPath and gcsfuse
    • Pod age is at least 5 minutes
    • main container has not started, has not restarted, and has no previous running/terminated state
  • Max retry attempts defaults to 1.
  • No retry for user-code Failed/Succeeded states.
  • After retry exhaustion, the same classified stuck state maps to ERROR so the existing orchestrator behavior can stop leaving it pending forever.
  • Logs caveat: deleting the stuck Pod may make live Pod logs unavailable unless cluster logging already collected them; the prototype logs a warning before deletion.

Validation

cd backend
PYTHONPATH=. uv run --frozen pytest tests/test_kubernetes_pod_retry.py -q
PYTHONPATH=. uv run --frozen black --check --target-version py310 cloud_pipelines_backend/launchers/kubernetes_launchers.py tests/test_kubernetes_pod_retry.py

Results:

  • 4 passed
  • black check passed

Notes

This PR is opened against TangleML/tangle because the actual changed code lives in the nested backend repository/submodule used by Shopify/oasis-backend.

Assisted-By: devx/ba25adb1-db34-47af-bfef-ecf43bc6627a
@Volv-G Volv-G closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant