Skip to content

[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect#56676

Closed
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/connect-rp
Closed

[SPARK-57615][CONNECT][TESTS] Wait for the Connect server before creating a ResourceProfile in test_profile_before_sc_for_connect#56676
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/connect-rp

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

test_profile_before_sc_for_connect creates a ResourceProfile over Spark Connect immediately
after SparkSession.builder.remote(...).getOrCreate(). This PR makes the test wait for the Connect
server to be ready before doing so, using the existing pyspark.testing.eventually helper to retry a
trivial job until it succeeds:

from pyspark.testing.utils import eventually

def _server_ready() -> bool:
    spark.range(1).count()
    return True

eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()
rp.id

Why are the changes needed?

The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its
Run tests (local-cluster) step, where the server is started with
start-connect-server.sh --master "local-cluster[2, 4, 1024]". That script returns before the
local-cluster SparkContext is fully initialized, so the first command(s) issued against it can
fail server-side. test_connect_resources is the first test in that step, so it races server
startup and fails intermittently (~60% of runs), observed as a bare java.lang.AssertionError on
rp.id, or SparkConnectGrpcException: Application error processing RPC on the first job. When the
cluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it
deterministic.

This is a test-only stabilization. The underlying server behavior (an internal error leaking on a
very-early command before the context is ready) is a separate, deeper robustness concern and is not
addressed here.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the scheduled workflow (build_python_connect.yml) on a fork. The Connect-only build is green
end-to-end, including the previously-flaky local-cluster step, on two consecutive runs:

The default build_and_test on this branch is also green:
https://github.com/HyukjinKwon/spark/actions/runs/27973120689

Note: the Connect-only build's Run tests (local) step also requires the import fix in #56644
(SPARK-57598); the validation runs above were performed on a branch carrying both changes so the
local-cluster step is reached. This PR contains only the test_connect_resources.py change.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

…ting a ResourceProfile in test_profile_before_sc_for_connect

test_profile_before_sc_for_connect creates a ResourceProfile over Spark Connect right after
SparkSession.builder.remote(...).getOrCreate(). In the Connect-only local-cluster build,
start-connect-server.sh returns before the local-cluster SparkContext is fully initialized, so the
first commands can fail server-side (observed as a bare java.lang.AssertionError or an
'Application error processing RPC' on the first job). The test is the first to run in that step and
fast-fails about 60% of the time.

Wait for the server to be ready (a trivial job succeeds via testing.eventually) before creating the
ResourceProfile.

Co-authored-by: Isaac
spark.range(1).count()
return True

eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we have a more explicit exception to catch, it might be better than having Exception here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - narrowed it to PySparkException (the base class of the SparkConnect* exceptions the not-yet-ready server raises) instead of bare Exception.

I kept it at the PySparkException level rather than a specific SparkConnect* subclass for two reasons: (1) importing the connect-specific exception at module top-level would require grpc and break test collection on grpc-less builds, and (2) the not-ready server can surface the race as a few different SparkConnect* subclasses, so narrowing further risks not retrying the very error we're guarding against. Also moved the eventually import up to the module-level imports.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is more fore debugging yet to identify why it is hanging

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so the dumped stack trace is not enough? pystack worked right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs a separate server so the output better has to be thrown to the console which should be easier. Anyway, this fix is both for debugging and actual fixing :-). Should be fine

…on to PySparkException and hoist eventually import

- Move 'from pyspark.testing.utils import eventually' to the module-level imports.
- Catch PySparkException (the base of the SparkConnect* exceptions the not-yet-ready
  server raises) instead of bare Exception. Kept at the PySparkException level rather
  than a specific SparkConnect* subclass to avoid a grpc-dependent top-level import
  (which would break test collection on grpc-less builds) and to not over-narrow the
  retry condition.
@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master and branch-4.x.

HyukjinKwon added a commit that referenced this pull request Jun 23, 2026
…ing a ResourceProfile in test_profile_before_sc_for_connect

### What changes were proposed in this pull request?

`test_profile_before_sc_for_connect` creates a `ResourceProfile` over Spark Connect immediately
after `SparkSession.builder.remote(...).getOrCreate()`. This PR makes the test wait for the Connect
server to be ready before doing so, using the existing `pyspark.testing.eventually` helper to retry a
trivial job until it succeeds:

```python
from pyspark.testing.utils import eventually

def _server_ready() -> bool:
    spark.range(1).count()
    return True

eventually(timeout=120, expected_exceptions=(Exception,))(_server_ready)()
rp.id
```

### Why are the changes needed?

The scheduled "Build / Python-only, Connect-only (Python 3.11)" build runs this test in its
`Run tests (local-cluster)` step, where the server is started with
`start-connect-server.sh --master "local-cluster[2, 4, 1024]"`. That script returns before the
local-cluster `SparkContext` is fully initialized, so the first command(s) issued against it can
fail server-side. `test_connect_resources` is the first test in that step, so it races server
startup and fails intermittently (~60% of runs), observed as a bare `java.lang.AssertionError` on
`rp.id`, or `SparkConnectGrpcException: Application error processing RPC` on the first job. When the
cluster happens to be ready, the test passes (~22-77s). Waiting for readiness first makes it
deterministic.

This is a test-only stabilization. The underlying server behavior (an internal error leaking on a
very-early command before the context is ready) is a separate, deeper robustness concern and is not
addressed here.

### Does this PR introduce _any_ user-facing change?

No. Test-only change.

### How was this patch tested?

Ran the scheduled workflow (`build_python_connect.yml`) on a fork. The Connect-only build is green
end-to-end, including the previously-flaky `local-cluster` step, on two consecutive runs:

- https://github.com/HyukjinKwon/spark/actions/runs/27969109059 (attempt 1 and attempt 2: both
  `Run tests (local)` and `Run tests (local-cluster)` green)

The default `build_and_test` on this branch is also green:
https://github.com/HyukjinKwon/spark/actions/runs/27973120689

Note: the Connect-only build's `Run tests (local)` step also requires the import fix in #56644
(SPARK-57598); the validation runs above were performed on a branch carrying both changes so the
`local-cluster` step is reached. This PR contains only the `test_connect_resources.py` change.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

Closes #56676 from HyukjinKwon/ci-fix/connect-rp.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit 3a5d616)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants