Skip to content

Add instance health checks#234

Merged
sjmiller609 merged 13 commits into
mainfrom
hypeship/add-healthcheck-policy
May 18, 2026
Merged

Add instance health checks#234
sjmiller609 merged 13 commits into
mainfrom
hypeship/add-healthcheck-policy

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 16, 2026

Summary

  • add instance health_check policy and health_status response fields for http, tcp, and exec probes
  • add a health check controller owned by the instance manager, with timing, thresholds, start-period handling, and runtime status persistence
  • start health checks while instances are Initializing or Running, while keeping public health status starting until the instance reaches Running
  • add HTTP healthcheck assertions to TestCreateInstanceWithNetwork so the VM-starting network path waits for persisted healthy status
  • wire the controller into the api process and document lifecycle semantics in lib/healthcheck/README.md

Tests

  • go test ./lib/healthcheck
  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=0
  • go test ./lib/instances -run 'TestHealthCheck|TestValidateCreateRequestHealthCheck|TestValidateUpdateInstanceRequest|TestManagerUpdateInstanceHealthCheckOnlyPublishesLifecycleUpdate|TestLifecycleEventMetrics_ObserveSubscribersQueueDepthAndDrops|TestLifecycleSubscribers'
  • go test ./cmd/api/api -run 'TestCreateInstance_MapsHealthCheckPolicy|TestUpdateInstance_MapsHealthCheckPatch|TestCreateInstance_MapsAutoStandbyPolicy|TestUpdateInstance_MapsAutoStandbyPatch'
  • go test ./cmd/api -run TestDoesNotExist
  • go test ./lib/providers

Notes

  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=1 was attempted twice; both runs failed before instance creation because the existing nginx image readiness wait still saw image status pending after 60s.
  • go test ./cmd/api/api is currently blocked by Docker Hub unauthenticated pull rate limits and local network bridge permissions in existing integration tests.
  • make generate-wire is currently blocked because the checked-in wire binary was built with Go 1.24 and this package now requires Go 1.25; wire_gen.go was updated in the same small shape and go test ./cmd/api -run TestDoesNotExist passes.

Note

Medium Risk
Adds a new health-check policy surface area, background controller, and metadata persistence path; bugs could impact instance API behavior and metadata writes, but lifecycle state is intentionally unchanged.

Overview
Adds first-class instance health checks via new health_check policy (HTTP/TCP/exec) and health_status fields in the API/OpenAPI, including request validation, defaulting/normalization, and bidirectional mapping between OAPI and domain types.

Introduces a new lib/healthcheck package plus an instances.HealthCheckController that subscribes to lifecycle events, schedules probes with interval/timeout/threshold/start-period semantics, and persists per-instance runtime status; the controller is wired into the API process startup. Instance metadata now stores health_check_runtime, and saveMetadata switches to atomic temp-file + rename writes; update flows reset health runtime when the health check policy changes, and tests/integration tests are expanded to cover health-check behavior and lifecycle metrics labeling.

Reviewed by Cursor Bugbot for commit e0e1dda. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add instance health checks
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ✅build ✅lint ❗test ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@6335d0e4156a205becb27419bf593b14580fba45

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-18 15:04:47 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review May 17, 2026 17:08
Comment thread lib/instances/health_check_controller.go Outdated
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Instance Health Checks (PR #234)

This PR adds a new health-check subsystem to hypeman: POST /instances and PUT /instances/{id} now accept an optional health_check policy (HTTP, TCP, or exec probes), and a new HealthCheckController goroutine runs alongside the existing AutoStandbyController to drive periodic probes and persist runtime status. The GET /instances/{id} response gains health_check + health_status fields.

The main risks are: (1) validation errors in toDomainHealthCheck/NormalizePolicy surfacing as unexpected 400s on existing callers who send bodies that incidentally conflict with new fields, (2) the new controller goroutine panicking or leaking timers under high instance churn, and (3) exec probes firing guest-agent commands on instances that lack a guest-agent, triggering error log noise. API 5xx error rate baseline is 0.013–0.018% (30–35 errors/hr out of ~190K–280K req/hr); 400 error baseline is ~267 in the latest 4-hour window. Status updates will be posted automatically on this PR as monitoring progresses.

Key risks to watch:

  • Spike in HTTP 400 responses on /instances endpoints (invalid_health_check errors from new validation path)
  • Unhandled panics or nil pointer dereference errors in HealthCheckController.Run or timer callbacks
  • API 5xx error rate exceeding 0.05% (3× normal baseline) sustained for >15 min
  • Log errors: "failed to set health check runtime" or "health check controller started" absent after deploy

View agent

Comment thread lib/instances/health_check_controller.go
Comment thread lib/instances/types.go
Comment thread lib/instances/health_check_controller.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36c8c34. Configure here.

Comment thread lib/healthcheck/status.go
Comment thread lib/instances/health_check_controller.go
@sjmiller609 sjmiller609 merged commit 9db1f75 into main May 18, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/add-healthcheck-policy branch May 18, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants