experimental/ssh: surface connect failures instead of hanging#5456
Draft
anton-107 wants to merge 1 commit into
Draft
experimental/ssh: surface connect failures instead of hanging#5456anton-107 wants to merge 1 commit into
anton-107 wants to merge 1 commit into
Conversation
Improve diagnostics when `databricks ssh connect` fails. - Surface bootstrap job-run errors: when the SSH server bootstrap job reaches a terminal/failed state, fetch the run's state message, notebook error/trace, and run-page URL and show them, instead of the generic "server metadata error / metadata.json doesn't exist". - Guard against hangs when the server is up but the handshake never completes (e.g. the container image has no OpenSSH server, so the server can't launch /usr/sbin/sshd and holds the websocket open). The client now aborts after a handshake timeout with an actionable hint, and exits promptly when the server closes the connection, instead of hanging until ssh's ConnectTimeout. - Add an openssh-server hint when ssh exits with its connection-failure code (255). Tests cover the failed-run message formatting, the fast exit on server close, and the handshake timeout. WIP: the missing-sshd path still incurs a handshake-timeout wait; a server-side pre-flight sshd check (tracked separately) would turn it into an immediate, clear job failure. Co-authored-by: Isaac
Collaborator
|
Commit: c61de58
27 interesting tests: 15 SKIP, 6 flaky, 6 RECOVERED
Top 23 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Originated from a customer case:
databricks ssh connectto a dedicated cluster whoseDocker container image was missing an OpenSSH server (
/usr/sbin/sshd). The failuresurfaced terribly — either a generic
server metadata error / metadata.json doesn't exist,or the client just hung (the local
sshwaited on its 360sConnectTimeout). The rootcause was buried in the cluster's job-run logs.
This PR improves the diagnostics for
ssh connectfailures.What
Surface bootstrap job-run errors. When the SSH server bootstrap job reaches a
terminal/failed state, fetch the run's state message, notebook error/trace, and run-page
URL and show them — both when the task terminates before reaching RUNNING and when it dies
after, during metadata polling. (
experimental/ssh/internal/client/client.go)Guard against hangs when the server is up but the handshake never completes. If the
container image has no
sshd, the server can't launch/usr/sbin/sshdon connect andholds the websocket open, so both proxy loops block forever. The client now runs the
proxy loops in the background and aborts after a handshake timeout (no server response)
with an actionable hint, and also exits promptly when the server does close the
connection. (
experimental/ssh/internal/proxy/client.go)openssh-server hint when
sshexits with its connection-failure code (255).(
spawnSSHClient)Tests
client_internal_test.go: failed-run message formatting (state message + trace + run URL),truncation, terminal-state detection (SDK mocks).
proxy/client_server_test.go: fast exit when the server closes the connection; abort on thehandshake timeout when the server sends nothing.
All
experimental/ssh/...tests pass; lint clean.Status / follow-ups (WIP)
sshdpath still incurs a ~30s handshake-timeout wait before failing. Thecleaner fix is a server-side pre-flight
sshdcheck (fail the bootstrap job immediatelywith a clear message), tracked separately — that would turn this case into an instant,
clear job failure handled by improvement Bump github.com/databrickslabs/terraform-provider-databricks from 0.5.7 to 0.5.8 #1.
shortened or made configurable.
This pull request and its description were written by Isaac.