[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect by HyukjinKwon · Pull Request #56676 · apache/spark

HyukjinKwon · 2026-06-22T19:07:01Z

What changes were proposed in this pull request?

test_profile_before_sc_for_connect creates a ResourceProfile over Spark Connect immediately
after SparkSession.builder.remote(...).getOrCreate(). This PR makes the test wait for the Connect
server to be ready before doing so, using the existing pyspark.testing.eventually helper to retry a
trivial job until it succeeds:

from pyspark.testing.utils import eventually

def _server_ready() -> bool:
    spark.range(1).count()
    return True

eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()
rp.id

Why are the changes needed?

The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its
Run tests (local-cluster) step, where the server is started with
start-connect-server.sh --master "local-cluster[2, 4, 1024]". That script returns before the
local-cluster SparkContext is fully initialized, so the first command(s) issued against it can
fail server-side. test_connect_resources is the first test in that step, so it races server
startup and fails intermittently (~60% of runs), observed as a bare java.lang.AssertionError on
rp.id, or SparkConnectGrpcException: Application error processing RPC on the first job. When the
cluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it
deterministic.

This is a test-only stabilization. The underlying server behavior (an internal error leaking on a
very-early command before the context is ready) is a separate, deeper robustness concern and is not
addressed here.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the scheduled workflow (build_python_connect.yml) on a fork. The Connect-only build is green
end-to-end, including the previously-flaky local-cluster step, on two consecutive runs:

https://github.com/HyukjinKwon/spark/actions/runs/27969109059 (attempt 1 and attempt 2: both
Run tests (local) and Run tests (local-cluster) green)

The default build_and_test on this branch is also green:
https://github.com/HyukjinKwon/spark/actions/runs/27973120689

Note: the Connect-only build's Run tests (local) step also requires the import fix in #56644
(SPARK-57598); the validation runs above were performed on a branch carrying both changes so the
local-cluster step is reached. This PR contains only the test_connect_resources.py change.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

…ting a ResourceProfile in test_profile_before_sc_for_connect test_profile_before_sc_for_connect creates a ResourceProfile over Spark Connect right after SparkSession.builder.remote(...).getOrCreate(). In the Connect-only local-cluster build, start-connect-server.sh returns before the local-cluster SparkContext is fully initialized, so the first commands can fail server-side (observed as a bare java.lang.AssertionError or an 'Application error processing RPC' on the first job). The test is the first to run in that step and fast-fails about 60% of the time. Wait for the server to be ready (a trivial job succeeds via testing.eventually) before creating the ResourceProfile. Co-authored-by: Isaac

gaogaotiantian · 2026-06-22T20:09:14Z

+            spark.range(1).count()
+            return True
+
+        eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()


nit: if we have a more explicit exception to catch, it might be better than having Exception here.

Good point - narrowed it to PySparkException (the base class of the SparkConnect* exceptions the not-yet-ready server raises) instead of bare Exception.

I kept it at the PySparkException level rather than a specific SparkConnect* subclass for two reasons: (1) importing the connect-specific exception at module top-level would require grpc and break test collection on grpc-less builds, and (2) the not-ready server can surface the race as a few different SparkConnect* subclasses, so narrowing further risks not retrying the very error we're guarding against. Also moved the eventually import up to the module-level imports.

BTW, this is more fore debugging yet to identify why it is hanging

Oh so the dumped stack trace is not enough? pystack worked right?

It runs a separate server so the output better has to be thrown to the console which should be easier. Anyway, this fix is both for debugging and actual fixing :-). Should be fine

…on to PySparkException and hoist eventually import - Move 'from pyspark.testing.utils import eventually' to the module-level imports. - Catch PySparkException (the base of the SparkConnect* exceptions the not-yet-ready server raises) instead of bare Exception. Kept at the PySparkException level rather than a specific SparkConnect* subclass to avoid a grpc-dependent top-level import (which would break test collection on grpc-less builds) and to not over-narrow the retry condition.

HyukjinKwon · 2026-06-23T01:32:56Z

Merged to master and branch-4.x.

…ing a ResourceProfile in test_profile_before_sc_for_connect ### What changes were proposed in this pull request? `test_profile_before_sc_for_connect` creates a `ResourceProfile` over Spark Connect immediately after `SparkSession.builder.remote(...).getOrCreate()`. This PR makes the test wait for the Connect server to be ready before doing so, using the existing `pyspark.testing.eventually` helper to retry a trivial job until it succeeds: ```python from pyspark.testing.utils import eventually def _server_ready() -> bool: spark.range(1).count() return True eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)() rp.id ``` ### Why are the changes needed? The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its `Run tests (local-cluster)` step, where the server is started with `start-connect-server.sh --master "local-cluster[2, 4, 1024]"`. That script returns before the local-cluster `SparkContext` is fully initialized, so the first command(s) issued against it can fail server-side. `test_connect_resources` is the first test in that step, so it races server startup and fails intermittently (~60% of runs), observed as a bare `java.lang.AssertionError` on `rp.id`, or `SparkConnectGrpcException: Application error processing RPC` on the first job. When the cluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it deterministic. This is a test-only stabilization. The underlying server behavior (an internal error leaking on a very-early command before the context is ready) is a separate, deeper robustness concern and is not addressed here. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Ran the scheduled workflow (`build_python_connect.yml`) on a fork. The Connect-only build is green end-to-end, including the previously-flaky `local-cluster` step, on two consecutive runs: - https://github.com/HyukjinKwon/spark/actions/runs/27969109059 (attempt 1 and attempt 2: both `Run tests (local)` and `Run tests (local-cluster)` green) The default `build_and_test` on this branch is also green: https://github.com/HyukjinKwon/spark/actions/runs/27973120689 Note: the Connect-only build's `Run tests (local)` step also requires the import fix in #56644 (SPARK-57598); the validation runs above were performed on a branch carrying both changes so the `local-cluster` step is reached. This PR contains only the `test_connect_resources.py` change. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) Closes #56676 from HyukjinKwon/ci-fix/connect-rp. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit 3a5d616) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

gaogaotiantian reviewed Jun 22, 2026

View reviewed changes

gaogaotiantian approved these changes Jun 22, 2026

View reviewed changes

HyukjinKwon force-pushed the ci-fix/connect-rp branch from 6e123b6 to 890546b Compare June 23, 2026 00:03

HyukjinKwon closed this in 3a5d616 Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect#56676

[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect#56676
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/connect-rp

HyukjinKwon commented Jun 22, 2026

Uh oh!

gaogaotiantian Jun 22, 2026

Uh oh!

HyukjinKwon Jun 22, 2026

Uh oh!

HyukjinKwon Jun 22, 2026

Uh oh!

gaogaotiantian Jun 22, 2026

Uh oh!

HyukjinKwon Jun 22, 2026

Uh oh!

HyukjinKwon commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HyukjinKwon commented Jun 22, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gaogaotiantian Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants