Skip to content

Wire server discovery protocol into thv serve#4319

Open
JAORMX wants to merge 2 commits intomainfrom
wire-server-discovery-protocol
Open

Wire server discovery protocol into thv serve#4319
JAORMX wants to merge 2 commits intomainfrom
wire-server-discovery-protocol

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Mar 23, 2026

Summary

RFC-0034 proposes a long-running local server architecture where thv serve advertises itself via a discovery file so clients (CLI, Studio) can find it without hardcoded ports. The discovery package (pkg/server/discovery/) was already fully implemented and tested but had zero imports anywhere in the codebase. This PR wires it in, giving us a working discovery protocol that immediately benefits Studio and the skills CLI client.

  • The server now writes a discovery file ($XDG_STATE_HOME/toolhive/server/server.json) on startup with the actual listen URL, PID, and a cryptographic nonce, and removes it on shutdown
  • The /health endpoint returns the nonce via X-Toolhive-Nonce so clients can verify server instance identity (prevents PID-reuse confusion)
  • The skills client now auto-discovers the server via the discovery file before falling back to TOOLHIVE_API_URL or localhost:8080, with loopback and socket-path validation on discovered URLs
  • Fixes: SIGTERM handling (was only SIGINT), 30-second shutdown timeout (was unbounded), symlink rejection on the discovery file read path, directory permission tightening, constant-time nonce comparison
  • A startup guard refuses to start if another healthy server is already running

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual testing: run thv serve, verify server.json appears with correct URL/PID/nonce, curl -v /health shows X-Toolhive-Nonce header, stop server and verify file is removed, start a second server while first is running and verify it refuses.

Changes

File Change
pkg/server/discovery/*.go Discovery package (was implemented but unwired) with security hardening: symlink check on read, dir permission tightening, constant-time nonce comparison, exported URL validators
pkg/api/server.go Nonce generation, WithNonce() builder, ListenURL(), discovery write/remove in lifecycle, startup guard, shutdown timeout
pkg/api/v1/healthcheck.go Nonce parameter, X-Toolhive-Nonce header on 204 responses
cmd/thv/app/server.go Add SIGTERM to signal handling
pkg/skills/client/client.go Discovery-first URL resolution with loopback/socket-path validation
pkg/api/server_test.go Tests for generateNonce, ListenURL (TCP + Unix)
pkg/api/v1/healtcheck_test.go Tests for nonce header present/absent/unhealthy
pkg/server/discovery/*_test.go Tests for symlink read rejection, dir permission tightening, paralleltest fixes

Does this introduce a user-facing change?

thv serve now writes a discovery file that allows other tools (Studio, skills CLI) to automatically find the running server without requiring TOOLHIVE_API_URL or knowing the port. This is transparent to users -- existing behavior is preserved as a fallback.

Special notes for reviewers

  • The pkg/server/discovery/ package was previously implemented and tested in a separate effort but never wired in. This PR includes it as new files since it had no prior commits.
  • The startup guard prevents silent orphaning: if a healthy server is already running, the new server refuses to start. Stale discovery files (from crashes/SIGKILL) are automatically cleaned up.
  • The nonce is explicitly not a secret -- it is a server instance identifier returned in a plaintext HTTP header and stored in a 0600 file. It prevents PID-reuse confusion, not unauthorized access.
  • Security review found no high/critical issues. TOCTOU gaps on symlink checks are mitigated by the 0700 directory permissions (single-user trust model).
  • Future work: thv server start/stop/status subcommands, CLI auto-start, lock file singleton, SSE events, Studio integration.

Large PR Justification

  • This contains both the setup of the server discovery and the usage.

Generated with Claude Code

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Mar 23, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 36bc774 to 7fdf8fa Compare March 23, 2026 14:13
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 23, 2026
@codecov
Copy link

codecov bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 46.22222% with 121 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.13%. Comparing base (df1a5cd) to head (fc8052c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/api/server.go 14.00% 42 Missing and 1 partial ⚠️
pkg/skills/client/client.go 17.94% 30 Missing and 2 partials ⚠️
pkg/server/discovery/discovery.go 52.38% 14 Missing and 6 partials ⚠️
pkg/server/discovery/discover.go 41.93% 17 Missing and 1 partial ⚠️
pkg/server/discovery/health.go 89.83% 3 Missing and 3 partials ⚠️
pkg/api/v1/healthcheck.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4319      +/-   ##
==========================================
+ Coverage   69.08%   69.13%   +0.05%     
==========================================
  Files         478      481       +3     
  Lines       48432    48667     +235     
==========================================
+ Hits        33457    33646     +189     
- Misses      12314    12417     +103     
+ Partials     2661     2604      -57     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 7fdf8fa to c166dd9 Compare March 24, 2026 06:04
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from c166dd9 to c6e6ece Compare March 24, 2026 06:10
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from c6e6ece to 6948298 Compare March 24, 2026 06:19
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
Copy link
Member

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work wiring in the discovery protocol. The migration path is seamless (no server.json existed before, so StateNotFound handles all upgrade scenarios cleanly), and the rollback edge case also works correctly (nonce mismatch triggers overwrite). A few non-blocking observations below, mostly around multi-process safety and context propagation.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 6948298 to e3196be Compare March 24, 2026 11:05
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@github-actions github-actions bot dismissed their stale review March 24, 2026 11:05

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from e3196be to 4ca07b2 Compare March 24, 2026 11:11
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
Copy link
Member

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed against PR description. The implementation looks solid — discovery protocol is well-structured with good security considerations (symlink rejection, loopback validation, constant-time nonce comparison, directory permission tightening, startup guard).

Mismatches between PR description and actual code:

  1. SIGTERM handling not in this PR: The Summary and Changes table claim cmd/thv/app/server.go was changed to add SIGTERM signal handling, but this file is not modified in the diff. This change was likely landed separately (possibly in #4327 based on the recent commit history: 5769a5d2 Gracefully handle SIGTERM in thv HTTP server).

  2. Shutdown timeout not added by this PR: The description claims "30-second shutdown timeout (was unbounded)" as a fix, but shutdownTimeout = 30 * time.Second appears to be pre-existing code, not introduced here.

  3. Discovery vs env var priority is described backwards: The Summary says the skills client "auto-discovers the server via the discovery file before falling back to TOOLHIVE_API_URL", but the actual code in pkg/skills/client/client.go checks TOOLHIVE_API_URL first (explicit override wins), then tries discovery, then falls back to localhost:8080. The NewDefaultClient godoc and the numbered resolution logic are correct — only the Summary bullet inverts the priority.

These are description-only issues; the code itself is correct and well-implemented.

JAORMX and others added 2 commits March 24, 2026 12:38
The discovery package (pkg/server/discovery/) was already implemented
and tested but had zero imports in the codebase. This wires it into
the serve command so clients (CLI, Studio) can auto-discover a running
server without hardcoded ports or environment variables.

On startup, thv serve now generates a cryptographic nonce, writes a
discovery file to $XDG_STATE_HOME/toolhive/server/server.json with
the actual listen URL (supporting port 0 and Unix sockets), and
returns the nonce via the X-Toolhive-Nonce health check header. On
shutdown the file is removed.

The skills client now tries discovery before falling back to the
TOOLHIVE_API_URL env var or the default localhost:8080, with loopback
and socket-path validation on discovered URLs.

Additional fixes: SIGTERM handling in the serve command, a 30-second
shutdown timeout (was unbounded), symlink rejection on the discovery
file read path, directory permission tightening after MkdirAll, and
constant-time nonce comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
- Wrap writeDiscoveryFile check-then-write in WithFileLock to prevent
  TOCTOU race when two servers start simultaneously
- Log FindProcess errors at Debug level instead of silently discarding
- Consolidate ListenURL tests into a table-driven test
- Rename healtcheck_test.go to healthcheck_test.go (fix typo)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 4ca07b2 to fc8052c Compare March 24, 2026 12:38
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
Copy link
Contributor

@blkt blkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple questions, but overall looks good to me.

Comment on lines -15 to +19
func HealthcheckRouter(containerRuntime rt.Runtime) http.Handler {
routes := &healthcheckRoutes{containerRuntime: containerRuntime}
// The nonce parameter, when non-empty, is returned via the X-Toolhive-Nonce
// header so clients can verify they are talking to the expected server instance.
func HealthcheckRouter(containerRuntime rt.Runtime, nonce string) http.Handler {
routes := &healthcheckRoutes{containerRuntime: containerRuntime, nonce: nonce}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I might be wrong, but I assume the nonce is checked for security reasons, could you clarify what's the scenario we're protecting from?

Comment on lines +98 to +99
merged = append(merged, httpOpts...)
merged = append(merged, opts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this operation shadows httpOpts with opts when they both provide the same WithXXX function with different parameters (assuming said WithXXX function always sets the value and does not perform more complex operations). It's probably worth a comment clarifying this is working as intended.

Comment on lines +125 to +126
socketPath, err := discovery.ParseUnixSocketPath(serverURL)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this validation and the one below could happen as part of a WithBaseURL (or pair of WithUnixSocket and WithHTTPSocket) function, improving encapsulation.

Comment on lines +64 to +89
func buildHealthClient(serverURL string) (*http.Client, string, error) {
switch {
case strings.HasPrefix(serverURL, "unix://"):
socketPath, err := ParseUnixSocketPath(serverURL)
if err != nil {
return nil, "", err
}
client := &http.Client{
Transport: &http.Transport{
DialContext: func(_ context.Context, _, _ string) (net.Conn, error) {
return net.Dial("unix", socketPath)
},
},
}
return client, "http://localhost/health", nil

case strings.HasPrefix(serverURL, "http://"):
if err := ValidateLoopbackURL(serverURL); err != nil {
return nil, "", err
}
return &http.Client{}, serverURL + "/health", nil

default:
return nil, "", fmt.Errorf("unsupported URL scheme: %s", serverURL)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this logic seems duplicated with the one below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants