Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
36bc774 to
7fdf8fa
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4319 +/- ##
==========================================
+ Coverage 69.08% 69.13% +0.05%
==========================================
Files 478 481 +3
Lines 48432 48667 +235
==========================================
+ Hits 33457 33646 +189
- Misses 12314 12417 +103
+ Partials 2661 2604 -57 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7fdf8fa to
c166dd9
Compare
c166dd9 to
c6e6ece
Compare
c6e6ece to
6948298
Compare
aponcedeleonch
left a comment
There was a problem hiding this comment.
Nice work wiring in the discovery protocol. The migration path is seamless (no server.json existed before, so StateNotFound handles all upgrade scenarios cleanly), and the rollback edge case also works correctly (nonce mismatch triggers overwrite). A few non-blocking observations below, mostly around multi-process safety and context propagation.
6948298 to
e3196be
Compare
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
e3196be to
4ca07b2
Compare
aponcedeleonch
left a comment
There was a problem hiding this comment.
Code reviewed against PR description. The implementation looks solid — discovery protocol is well-structured with good security considerations (symlink rejection, loopback validation, constant-time nonce comparison, directory permission tightening, startup guard).
Mismatches between PR description and actual code:
-
SIGTERM handling not in this PR: The Summary and Changes table claim
cmd/thv/app/server.gowas changed to add SIGTERM signal handling, but this file is not modified in the diff. This change was likely landed separately (possibly in #4327 based on the recent commit history:5769a5d2 Gracefully handle SIGTERM in thv HTTP server). -
Shutdown timeout not added by this PR: The description claims "30-second shutdown timeout (was unbounded)" as a fix, but
shutdownTimeout = 30 * time.Secondappears to be pre-existing code, not introduced here. -
Discovery vs env var priority is described backwards: The Summary says the skills client "auto-discovers the server via the discovery file before falling back to
TOOLHIVE_API_URL", but the actual code inpkg/skills/client/client.gochecksTOOLHIVE_API_URLfirst (explicit override wins), then tries discovery, then falls back tolocalhost:8080. TheNewDefaultClientgodoc and the numbered resolution logic are correct — only the Summary bullet inverts the priority.
These are description-only issues; the code itself is correct and well-implemented.
The discovery package (pkg/server/discovery/) was already implemented and tested but had zero imports in the codebase. This wires it into the serve command so clients (CLI, Studio) can auto-discover a running server without hardcoded ports or environment variables. On startup, thv serve now generates a cryptographic nonce, writes a discovery file to $XDG_STATE_HOME/toolhive/server/server.json with the actual listen URL (supporting port 0 and Unix sockets), and returns the nonce via the X-Toolhive-Nonce health check header. On shutdown the file is removed. The skills client now tries discovery before falling back to the TOOLHIVE_API_URL env var or the default localhost:8080, with loopback and socket-path validation on discovered URLs. Additional fixes: SIGTERM handling in the serve command, a 30-second shutdown timeout (was unbounded), symlink rejection on the discovery file read path, directory permission tightening after MkdirAll, and constant-time nonce comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
- Wrap writeDiscoveryFile check-then-write in WithFileLock to prevent TOCTOU race when two servers start simultaneously - Log FindProcess errors at Debug level instead of silently discarding - Consolidate ListenURL tests into a table-driven test - Rename healtcheck_test.go to healthcheck_test.go (fix typo) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
4ca07b2 to
fc8052c
Compare
blkt
left a comment
There was a problem hiding this comment.
I left a couple questions, but overall looks good to me.
| func HealthcheckRouter(containerRuntime rt.Runtime) http.Handler { | ||
| routes := &healthcheckRoutes{containerRuntime: containerRuntime} | ||
| // The nonce parameter, when non-empty, is returned via the X-Toolhive-Nonce | ||
| // header so clients can verify they are talking to the expected server instance. | ||
| func HealthcheckRouter(containerRuntime rt.Runtime, nonce string) http.Handler { | ||
| routes := &healthcheckRoutes{containerRuntime: containerRuntime, nonce: nonce} |
There was a problem hiding this comment.
question: I might be wrong, but I assume the nonce is checked for security reasons, could you clarify what's the scenario we're protecting from?
| merged = append(merged, httpOpts...) | ||
| merged = append(merged, opts...) |
There was a problem hiding this comment.
nit: this operation shadows httpOpts with opts when they both provide the same WithXXX function with different parameters (assuming said WithXXX function always sets the value and does not perform more complex operations). It's probably worth a comment clarifying this is working as intended.
| socketPath, err := discovery.ParseUnixSocketPath(serverURL) | ||
| if err != nil { |
There was a problem hiding this comment.
nit: this validation and the one below could happen as part of a WithBaseURL (or pair of WithUnixSocket and WithHTTPSocket) function, improving encapsulation.
| func buildHealthClient(serverURL string) (*http.Client, string, error) { | ||
| switch { | ||
| case strings.HasPrefix(serverURL, "unix://"): | ||
| socketPath, err := ParseUnixSocketPath(serverURL) | ||
| if err != nil { | ||
| return nil, "", err | ||
| } | ||
| client := &http.Client{ | ||
| Transport: &http.Transport{ | ||
| DialContext: func(_ context.Context, _, _ string) (net.Conn, error) { | ||
| return net.Dial("unix", socketPath) | ||
| }, | ||
| }, | ||
| } | ||
| return client, "http://localhost/health", nil | ||
|
|
||
| case strings.HasPrefix(serverURL, "http://"): | ||
| if err := ValidateLoopbackURL(serverURL); err != nil { | ||
| return nil, "", err | ||
| } | ||
| return &http.Client{}, serverURL + "/health", nil | ||
|
|
||
| default: | ||
| return nil, "", fmt.Errorf("unsupported URL scheme: %s", serverURL) | ||
| } | ||
| } |
There was a problem hiding this comment.
nit: this logic seems duplicated with the one below.
Summary
RFC-0034 proposes a long-running local server architecture where
thv serveadvertises itself via a discovery file so clients (CLI, Studio) can find it without hardcoded ports. The discovery package (pkg/server/discovery/) was already fully implemented and tested but had zero imports anywhere in the codebase. This PR wires it in, giving us a working discovery protocol that immediately benefits Studio and the skills CLI client.$XDG_STATE_HOME/toolhive/server/server.json) on startup with the actual listen URL, PID, and a cryptographic nonce, and removes it on shutdown/healthendpoint returns the nonce viaX-Toolhive-Nonceso clients can verify server instance identity (prevents PID-reuse confusion)TOOLHIVE_API_URLorlocalhost:8080, with loopback and socket-path validation on discovered URLsType of change
Test plan
task test)task lint-fix)Manual testing: run
thv serve, verifyserver.jsonappears with correct URL/PID/nonce,curl -v /healthshowsX-Toolhive-Nonceheader, stop server and verify file is removed, start a second server while first is running and verify it refuses.Changes
pkg/server/discovery/*.gopkg/api/server.goWithNonce()builder,ListenURL(), discovery write/remove in lifecycle, startup guard, shutdown timeoutpkg/api/v1/healthcheck.goX-Toolhive-Nonceheader on 204 responsescmd/thv/app/server.goSIGTERMto signal handlingpkg/skills/client/client.gopkg/api/server_test.gogenerateNonce,ListenURL(TCP + Unix)pkg/api/v1/healtcheck_test.gopkg/server/discovery/*_test.goDoes this introduce a user-facing change?
thv servenow writes a discovery file that allows other tools (Studio, skills CLI) to automatically find the running server without requiringTOOLHIVE_API_URLor knowing the port. This is transparent to users -- existing behavior is preserved as a fallback.Special notes for reviewers
pkg/server/discovery/package was previously implemented and tested in a separate effort but never wired in. This PR includes it as new files since it had no prior commits.thv server start/stop/statussubcommands, CLI auto-start, lock file singleton, SSE events, Studio integration.Large PR Justification
Generated with Claude Code