Skip to content

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058

Open
holmuk wants to merge 1 commit intoapache:mainfrom
holmuk:bugfix/kubernetes-job-task-competition
Open

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058
holmuk wants to merge 1 commit intoapache:mainfrom
holmuk:bugfix/kubernetes-job-task-competition

Conversation

@holmuk
Copy link
Copy Markdown
Contributor

@holmuk holmuk commented Apr 11, 2026

Closes #64867


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    Cursor

This PR resolves the hanging Running state issue in KubernetesJobOperator / KubernetesJobTrigger for deferrable=True / do_xcom_push=True.

Problem description

The trigger waits for container completion for every pod name from a precomputed snapshot (pod_names) before checking the final Job status. That snapshot is built from pod discovery tied to parallelism, not to actual successful completions.

Example (parallelism=2, completions=1):

  • Airflow creates a Job
  • Kubernetes starts 2 pods
  • One pod succeeds
  • Job becomes Complete (completions=1 reached)
  • The second pod may never reach the expected terminal state
  • KubernetesJobTrigger keeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever.

Proposed fix: Task completion should be driven by Job terminal status (Complete / Failed), which already reflects completions:

  • Make Job status the primary completion condition.
  • Collect XCom/logs only as best-effort from pods that actually finished and are still readable.
  • Do not block task finalization on missing/non-terminal pods from the initial snapshot.

What does this PR do?

Updates logic for KubernetesJobTrigger:

  • The waiting flow is now job-first: completion is driven by final Job status, not by requiring all pods from the initial snapshot to finish.
  • XCom collection is now best-effort: results are collected only from pods that are available and successfully processed.
  • 404 for missing/deleted pods is handled as skip instead of failing the trigger.
  • The previous unbounded pod-first waits were removed: container waits are now bounded and periodically re-check whether the Job has already completed.

Regresssion tests for #64867

  • Trigger regression tests (triggers/test_job.py)

    • test_run_completes_when_job_is_done_even_if_some_snapshot_pods_never_complete: Verify the trigger does not hang when a pod from the initial snapshot never reaches terminal state after the Job is already complete.

    • test_run_skips_deleted_snapshot_pod_and_completes_when_job_is_done: verifies the trigger handles stale snapshot pods gracefully by skipping 404 Not Found pods and still finishing successfully with available XCom results.

    • test_run_collects_later_pod_xcom_best_effort_after_job_done: verifies post-completion best-effort behavior: once the Job is already complete, the trigger continues processing remaining snapshot pods, skips per-pod extraction failures, and still returns XCom from pods that can be read.

  • Operator regression test (operators/test_job.py)

    • test_execute_complete_supports_partial_xcom_results: Verify execute_complete correctly handles partial xcom_result payloads (fewer XCom entries than initial pod snapshot), which is expected in parallelism > completions scenarios.

Additional tests for new code

  • test_wait_until_container_state_or_job_done_does_not_restart_wait_task: Copilot pointed out that the naive implementation of the waiting loop may not work as expected on slow clusters because of constant coroutine retrying. The test validates that we don't recreate wait_method on every tick on a slow cluster.

Behavior change

  • Task finalization is now Job-driven.
  • xcom_result may be partial (fewer entries than initial pod_names) and this is expected.
  • Missing pods (404) do not fail task completion.

Risks

  • With very small poll_interval values, the new bounded wait loop may generate extra timeout/cancel/retry iterations while waiting for pod container states. This does not fail the task by itself (it is expected retry behavior), but it can increase polling overhead and log noise until the Job reaches a terminal state.
  • Best-effort post-job XCom no longer fails the task on per-pod extraction errors (e.g. RBAC/network). To keep this observable, the trigger now emits warnings and a summary with counters (succeeded, skipped_missing, timed_out, failed_other).

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deferrable KubernetesJobOperator / KubernetesJobTrigger hang when Kubernetes parallelism > completions by making trigger completion primarily driven by the Job’s terminal state (Complete/Failed) rather than waiting for every pod from an initial “snapshot” to reach a terminal state. It also adds regression tests to cover the reported scenario (#64867).

Changes:

  • Reworks KubernetesJobTrigger.run() to wait for Job completion concurrently and collect XCom from pods on a best-effort basis (skipping missing/deleted pods).
  • Adds regression tests to ensure the trigger doesn’t hang when some snapshot pods never complete or are deleted.
  • Adds an operator regression test verifying execute_complete tolerates partial XCom results.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py Changes trigger control flow to be job-first and makes XCom extraction best-effort without blocking task finalization.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_job.py Adds async regression tests for parallelism > completions pod snapshot edge cases and updates job polling assertion.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py Adds regression test ensuring execute_complete handles partial XCom payload lists.

@holmuk holmuk force-pushed the bugfix/kubernetes-job-task-competition branch from 475be15 to 55ab20b Compare April 11, 2026 18:36
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I am not really an expert with K8s Jobs, so I have a hard time judging details of the fix. Looking for a second maintainer review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesJobOperator task stuck in Running state when parallelism > completions

3 participants