Skip to content

experimental/ssh: surface connect failures instead of hanging#5456

Draft
anton-107 wants to merge 1 commit into
mainfrom
anton-107/ssh-surface-connect-errors
Draft

experimental/ssh: surface connect failures instead of hanging#5456
anton-107 wants to merge 1 commit into
mainfrom
anton-107/ssh-surface-connect-errors

Conversation

@anton-107
Copy link
Copy Markdown
Contributor

Why

Originated from a customer case: databricks ssh connect to a dedicated cluster whose
Docker container image was missing an OpenSSH server (/usr/sbin/sshd). The failure
surfaced terribly — either a generic server metadata error / metadata.json doesn't exist,
or the client just hung (the local ssh waited on its 360s ConnectTimeout). The root
cause was buried in the cluster's job-run logs.

This PR improves the diagnostics for ssh connect failures.

What

  1. Surface bootstrap job-run errors. When the SSH server bootstrap job reaches a
    terminal/failed state, fetch the run's state message, notebook error/trace, and run-page
    URL and show them — both when the task terminates before reaching RUNNING and when it dies
    after, during metadata polling. (experimental/ssh/internal/client/client.go)

  2. Guard against hangs when the server is up but the handshake never completes. If the
    container image has no sshd, the server can't launch /usr/sbin/sshd on connect and
    holds the websocket open, so both proxy loops block forever. The client now runs the
    proxy loops in the background and aborts after a handshake timeout (no server response)
    with an actionable hint, and also exits promptly when the server does close the
    connection. (experimental/ssh/internal/proxy/client.go)

  3. openssh-server hint when ssh exits with its connection-failure code (255).
    (spawnSSHClient)

Tests

  • client_internal_test.go: failed-run message formatting (state message + trace + run URL),
    truncation, terminal-state detection (SDK mocks).
  • proxy/client_server_test.go: fast exit when the server closes the connection; abort on the
    handshake timeout when the server sends nothing.

All experimental/ssh/... tests pass; lint clean.

Status / follow-ups (WIP)

  • The missing-sshd path still incurs a ~30s handshake-timeout wait before failing. The
    cleaner fix is a server-side pre-flight sshd check (fail the bootstrap job immediately
    with a clear message), tracked separately — that would turn this case into an instant,
    clear job failure handled by improvement Bump github.com/databrickslabs/terraform-provider-databricks from 0.5.7 to 0.5.8 #1.
  • The handshake timeout (30s) is conservative and currently a package constant; could be
    shortened or made configurable.
  • The proxy error and the outer 255 hint are slightly redundant; may consolidate.

This pull request and its description were written by Isaac.

Improve diagnostics when `databricks ssh connect` fails.

- Surface bootstrap job-run errors: when the SSH server bootstrap job
  reaches a terminal/failed state, fetch the run's state message,
  notebook error/trace, and run-page URL and show them, instead of the
  generic "server metadata error / metadata.json doesn't exist".
- Guard against hangs when the server is up but the handshake never
  completes (e.g. the container image has no OpenSSH server, so the
  server can't launch /usr/sbin/sshd and holds the websocket open). The
  client now aborts after a handshake timeout with an actionable hint,
  and exits promptly when the server closes the connection, instead of
  hanging until ssh's ConnectTimeout.
- Add an openssh-server hint when ssh exits with its connection-failure
  code (255).

Tests cover the failed-run message formatting, the fast exit on server
close, and the handshake timeout.

WIP: the missing-sshd path still incurs a handshake-timeout wait; a
server-side pre-flight sshd check (tracked separately) would turn it
into an immediate, clear job failure.

Co-authored-by: Isaac
@anton-107 anton-107 temporarily deployed to test-trigger-is June 5, 2026 15:06 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 5, 2026 15:06 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

Commit: c61de58

Run: 27022851886

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 15 261 923 6:57
🔄​ aws windows 3 6 15 261 921 9:57
🔄​ aws-ucws linux 2 7 15 355 837 8:04
💚​ aws-ucws windows 7 15 359 835 9:11
💚​ azure linux 1 17 264 921 5:50
💚​ azure windows 1 17 266 919 8:16
💚​ azure-ucws linux 1 17 362 833 8:07
💚​ azure-ucws windows 1 17 364 831 9:19
💚​ gcp linux 1 17 260 924 6:47
🔄​ gcp windows 2 1 17 260 922 13:53
27 interesting tests: 15 SKIP, 6 flaky, 6 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🔄​ TestAccept 💚​R 🔄​f 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/selftest/record_cloud/pipeline-crud ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestAccept/selftest/record_cloud/pipeline-crud/DATABRICKS_BUNDLE_ENGINE=direct ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f
🔄​ TestAccept/selftest/record_cloud/pipeline-crud/DATABRICKS_BUNDLE_ENGINE=terraform ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFsCpFileToFileFileNotOverwritten ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestFsCpFileToFileFileNotOverwritten/dbfs_to_uc-volumes 🙈​s 🙈​s 🔄​f ✅​p 🙈​s 🙈​s ✅​p ✅​p 🙈​s 🙈​s
Top 23 slowest tests (at least 2 minutes):
duration env testname
4:52 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:33 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:26 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:18 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:15 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:23 azure windows TestAccept
3:18 azure-ucws windows TestAccept
3:15 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:14 aws-ucws windows TestAccept
3:06 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:05 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:59 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:53 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:50 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:49 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:42 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:39 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:34 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:28 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:22 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants