[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect#56676
Conversation
…ting a ResourceProfile in test_profile_before_sc_for_connect test_profile_before_sc_for_connect creates a ResourceProfile over Spark Connect right after SparkSession.builder.remote(...).getOrCreate(). In the Connect-only local-cluster build, start-connect-server.sh returns before the local-cluster SparkContext is fully initialized, so the first commands can fail server-side (observed as a bare java.lang.AssertionError or an 'Application error processing RPC' on the first job). The test is the first to run in that step and fast-fails about 60% of the time. Wait for the server to be ready (a trivial job succeeds via testing.eventually) before creating the ResourceProfile. Co-authored-by: Isaac
| spark.range(1).count() | ||
| return True | ||
|
|
||
| eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)() |
There was a problem hiding this comment.
nit: if we have a more explicit exception to catch, it might be better than having Exception here.
There was a problem hiding this comment.
Good point - narrowed it to PySparkException (the base class of the SparkConnect* exceptions the not-yet-ready server raises) instead of bare Exception.
I kept it at the PySparkException level rather than a specific SparkConnect* subclass for two reasons: (1) importing the connect-specific exception at module top-level would require grpc and break test collection on grpc-less builds, and (2) the not-ready server can surface the race as a few different SparkConnect* subclasses, so narrowing further risks not retrying the very error we're guarding against. Also moved the eventually import up to the module-level imports.
There was a problem hiding this comment.
BTW, this is more fore debugging yet to identify why it is hanging
There was a problem hiding this comment.
Oh so the dumped stack trace is not enough? pystack worked right?
There was a problem hiding this comment.
It runs a separate server so the output better has to be thrown to the console which should be easier. Anyway, this fix is both for debugging and actual fixing :-). Should be fine
…on to PySparkException and hoist eventually import - Move 'from pyspark.testing.utils import eventually' to the module-level imports. - Catch PySparkException (the base of the SparkConnect* exceptions the not-yet-ready server raises) instead of bare Exception. Kept at the PySparkException level rather than a specific SparkConnect* subclass to avoid a grpc-dependent top-level import (which would break test collection on grpc-less builds) and to not over-narrow the retry condition.
6e123b6 to
890546b
Compare
|
Merged to master and branch-4.x. |
…ing a ResourceProfile in test_profile_before_sc_for_connect
### What changes were proposed in this pull request?
`test_profile_before_sc_for_connect` creates a `ResourceProfile` over Spark Connect immediately
after `SparkSession.builder.remote(...).getOrCreate()`. This PR makes the test wait for the Connect
server to be ready before doing so, using the existing `pyspark.testing.eventually` helper to retry a
trivial job until it succeeds:
```python
from pyspark.testing.utils import eventually
def _server_ready() -> bool:
spark.range(1).count()
return True
eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()
rp.id
```
### Why are the changes needed?
The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its
`Run tests (local-cluster)` step, where the server is started with
`start-connect-server.sh --master "local-cluster[2, 4, 1024]"`. That script returns before the
local-cluster `SparkContext` is fully initialized, so the first command(s) issued against it can
fail server-side. `test_connect_resources` is the first test in that step, so it races server
startup and fails intermittently (~60% of runs), observed as a bare `java.lang.AssertionError` on
`rp.id`, or `SparkConnectGrpcException: Application error processing RPC` on the first job. When the
cluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it
deterministic.
This is a test-only stabilization. The underlying server behavior (an internal error leaking on a
very-early command before the context is ready) is a separate, deeper robustness concern and is not
addressed here.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
Ran the scheduled workflow (`build_python_connect.yml`) on a fork. The Connect-only build is green
end-to-end, including the previously-flaky `local-cluster` step, on two consecutive runs:
- https://github.com/HyukjinKwon/spark/actions/runs/27969109059 (attempt 1 and attempt 2: both
`Run tests (local)` and `Run tests (local-cluster)` green)
The default `build_and_test` on this branch is also green:
https://github.com/HyukjinKwon/spark/actions/runs/27973120689
Note: the Connect-only build's `Run tests (local)` step also requires the import fix in #56644
(SPARK-57598); the validation runs above were performed on a branch carrying both changes so the
`local-cluster` step is reached. This PR contains only the `test_connect_resources.py` change.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)
Closes #56676 from HyukjinKwon/ci-fix/connect-rp.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit 3a5d616)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
What changes were proposed in this pull request?
test_profile_before_sc_for_connectcreates aResourceProfileover Spark Connect immediatelyafter
SparkSession.builder.remote(...).getOrCreate(). This PR makes the test wait for the Connectserver to be ready before doing so, using the existing
pyspark.testing.eventuallyhelper to retry atrivial job until it succeeds:
Why are the changes needed?
The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its
Run tests (local-cluster)step, where the server is started withstart-connect-server.sh --master "local-cluster[2, 4, 1024]". That script returns before thelocal-cluster
SparkContextis fully initialized, so the first command(s) issued against it canfail server-side.
test_connect_resourcesis the first test in that step, so it races serverstartup and fails intermittently (~60% of runs), observed as a bare
java.lang.AssertionErroronrp.id, orSparkConnectGrpcException: Application error processing RPCon the first job. When thecluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it
deterministic.
This is a test-only stabilization. The underlying server behavior (an internal error leaking on a
very-early command before the context is ready) is a separate, deeper robustness concern and is not
addressed here.
Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
Ran the scheduled workflow (
build_python_connect.yml) on a fork. The Connect-only build is greenend-to-end, including the previously-flaky
local-clusterstep, on two consecutive runs:Run tests (local)andRun tests (local-cluster)green)The default
build_and_teston this branch is also green:https://github.com/HyukjinKwon/spark/actions/runs/27973120689
Note: the Connect-only build's
Run tests (local)step also requires the import fix in #56644(SPARK-57598); the validation runs above were performed on a branch carrying both changes so the
local-clusterstep is reached. This PR contains only thetest_connect_resources.pychange.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)