Fix KubernetesJobTrigger hang for parallelism > completions case (#64867) by holmuk · Pull Request #65058 · apache/airflow

holmuk · 2026-04-11T12:40:19Z

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)
Cursor

This PR resolves the hanging Running state issue in KubernetesJobOperator / KubernetesJobTrigger for deferrable=True / do_xcom_push=True.

Problem description

The trigger waits for container completion for every pod name from a precomputed snapshot (pod_names) before checking the final Job status. That snapshot is built from pod discovery tied to parallelism, not to actual successful completions.

Example (parallelism=2, completions=1):

Airflow creates a Job
Kubernetes starts 2 pods
One pod succeeds
Job becomes Complete (completions=1 reached)
The second pod may never reach the expected terminal state
KubernetesJobTrigger keeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever.

Proposed fix: Task completion should be driven by Job terminal status (Complete / Failed), which already reflects completions:

Make Job status the primary completion condition.
Collect XCom/logs only as best-effort from pods that actually finished and are still readable.
Do not block task finalization on missing/non-terminal pods from the initial snapshot.

What does this PR do?

Updates logic for KubernetesJobTrigger:

The waiting flow is now job-first: completion is driven by final Job status, not by requiring all pods from the initial snapshot to finish.
XCom collection is now best-effort: results are collected only from pods that are available and successfully processed.
404 for missing/deleted pods is handled as skip instead of failing the trigger.
The previous unbounded pod-first waits were removed: container waits are now bounded and periodically re-check whether the Job has already completed.

Regresssion tests for #64867

Trigger regression tests (triggers/test_job.py)
- test_run_completes_when_job_is_done_even_if_some_snapshot_pods_never_complete: Verify the trigger does not hang when a pod from the initial snapshot never reaches terminal state after the Job is already complete.
- test_run_skips_deleted_snapshot_pod_and_completes_when_job_is_done: verifies the trigger handles stale snapshot pods gracefully by skipping 404 Not Found pods and still finishing successfully with available XCom results.
- test_run_collects_later_pod_xcom_best_effort_after_job_done: verifies post-completion best-effort behavior: once the Job is already complete, the trigger continues processing remaining snapshot pods, skips per-pod extraction failures, and still returns XCom from pods that can be read.
Operator regression test (operators/test_job.py)
- test_execute_complete_supports_partial_xcom_results: Verify execute_complete correctly handles partial xcom_result payloads (fewer XCom entries than initial pod snapshot), which is expected in parallelism > completions scenarios.

Additional tests for new code

test_wait_until_container_state_or_job_done_does_not_restart_wait_task: Copilot pointed out that the naive implementation of the waiting loop may not work as expected on slow clusters because of constant coroutine retrying. The test validates that we don't recreate wait_method on every tick on a slow cluster.

Behavior change

Task finalization is now Job-driven.
xcom_result may be partial (fewer entries than initial pod_names) and this is expected.
Missing pods (404) do not fail task completion.

Risks

With very small poll_interval values, the new bounded wait loop may generate extra timeout/cancel/retry iterations while waiting for pod container states. This does not fail the task by itself (it is expected retry behavior), but it can increase polling overhead and log noise until the Job reaches a terminal state.
Best-effort post-job XCom no longer fails the task on per-pod extraction errors (e.g. RBAC/network). To keep this observable, the trigger now emits warnings and a summary with counters (succeeded, skipped_missing, timed_out, failed_other).

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copilot

Pull request overview

This PR fixes a deferrable KubernetesJobOperator / KubernetesJobTrigger hang when Kubernetes parallelism > completions by making trigger completion primarily driven by the Job’s terminal state (Complete/Failed) rather than waiting for every pod from an initial “snapshot” to reach a terminal state. It also adds regression tests to cover the reported scenario (#64867).

Changes:

Reworks KubernetesJobTrigger.run() to wait for Job completion concurrently and collect XCom from pods on a best-effort basis (skipping missing/deleted pods).
Adds regression tests to ensure the trigger doesn’t hang when some snapshot pods never complete or are deleted.
Adds an operator regression test verifying execute_complete tolerates partial XCom results.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py`	Changes trigger control flow to be job-first and makes XCom extraction best-effort without blocking task finalization.
`providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_job.py`	Adds async regression tests for `parallelism > completions` pod snapshot edge cases and updates job polling assertion.
`providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py`	Adds regression test ensuring `execute_complete` handles partial XCom payload lists.

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py

…che#64867)

jscheffl

Looks good to me but I am not really an expert with K8s Jobs, so I have a hard time judging details of the fix. Looking for a second maintainer review.

holmuk requested review from hussein-awala, jedcunningham and jscheffl as code owners April 11, 2026 12:40

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 11, 2026

jscheffl requested a review from Copilot April 11, 2026 16:13

Copilot started reviewing on behalf of jscheffl April 11, 2026 16:14 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py Outdated Show resolved Hide resolved

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py Outdated Show resolved Hide resolved

Fix KubernetesJobTrigger hang for parallelism > completions case (apa…

55ab20b

…che#64867)

holmuk force-pushed the bugfix/kubernetes-job-task-competition branch from 475be15 to 55ab20b Compare April 11, 2026 18:36

jscheffl approved these changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058

Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058
holmuk wants to merge 1 commit intoapache:mainfrom
holmuk:bugfix/kubernetes-job-task-competition

holmuk commented Apr 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jscheffl left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

holmuk commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was generative AI tooling used to co-author this PR?

Problem description

What does this PR do?

Behavior change

Risks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

holmuk commented Apr 11, 2026 •

edited

Loading