Skip to content

Add instance health checks#234

Merged
sjmiller609 merged 13 commits into
mainfrom
hypeship/add-healthcheck-policy
May 18, 2026
Merged

Add instance health checks#234
sjmiller609 merged 13 commits into
mainfrom
hypeship/add-healthcheck-policy

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add instance health_check policy and health_status response fields for http, tcp, and exec probes
  • add a health check controller owned by the instance manager, with timing, thresholds, start-period handling, and runtime status persistence
  • start health checks while instances are Initializing or Running, while keeping public health status starting until the instance reaches Running
  • add HTTP healthcheck assertions to TestCreateInstanceWithNetwork so the VM-starting network path waits for persisted healthy status
  • wire the controller into the api process and document lifecycle semantics in lib/healthcheck/README.md

Tests

  • go test ./lib/healthcheck
  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=0
  • go test ./lib/instances -run 'TestHealthCheck|TestValidateCreateRequestHealthCheck|TestValidateUpdateInstanceRequest|TestManagerUpdateInstanceHealthCheckOnlyPublishesLifecycleUpdate|TestLifecycleEventMetrics_ObserveSubscribersQueueDepthAndDrops|TestLifecycleSubscribers'
  • go test ./cmd/api/api -run 'TestCreateInstance_MapsHealthCheckPolicy|TestUpdateInstance_MapsHealthCheckPatch|TestCreateInstance_MapsAutoStandbyPolicy|TestUpdateInstance_MapsAutoStandbyPatch'
  • go test ./cmd/api -run TestDoesNotExist
  • go test ./lib/providers

Notes

  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=1 was attempted twice; both runs failed before instance creation because the existing nginx image readiness wait still saw image status pending after 60s.
  • go test ./cmd/api/api is currently blocked by Docker Hub unauthenticated pull rate limits and local network bridge permissions in existing integration tests.
  • make generate-wire is currently blocked because the checked-in wire binary was built with Go 1.24 and this package now requires Go 1.25; wire_gen.go was updated in the same small shape and go test ./cmd/api -run TestDoesNotExist passes.

Note

Medium Risk
Adds a new health-check policy surface area, background controller, and metadata persistence path; bugs could impact instance API behavior and metadata writes, but lifecycle state is intentionally unchanged.

Overview
Adds first-class instance health checks via new health_check policy (HTTP/TCP/exec) and health_status fields in the API/OpenAPI, including request validation, defaulting/normalization, and bidirectional mapping between OAPI and domain types.

Introduces a new lib/healthcheck package plus an instances.HealthCheckController that subscribes to lifecycle events, schedules probes with interval/timeout/threshold/start-period semantics, and persists per-instance runtime status; the controller is wired into the API process startup. Instance metadata now stores health_check_runtime, and saveMetadata switches to atomic temp-file + rename writes; update flows reset health runtime when the health check policy changes, and tests/integration tests are expanded to cover health-check behavior and lifecycle metrics labeling.

Reviewed by Cursor Bugbot for commit e0e1dda. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented May 16, 2026

Copy link
Copy Markdown

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add instance health checks
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ✅build ✅lint ❗test ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@6335d0e4156a205becb27419bf593b14580fba45

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-18 15:04:47 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review May 17, 2026 17:08
Comment thread lib/instances/health_check_controller.go Outdated
@firetiger-agent

Copy link
Copy Markdown

Monitoring Plan: Instance Health Checks (PR #234)

This PR adds a new health-check subsystem to hypeman: POST /instances and PUT /instances/{id} now accept an optional health_check policy (HTTP, TCP, or exec probes), and a new HealthCheckController goroutine runs alongside the existing AutoStandbyController to drive periodic probes and persist runtime status. The GET /instances/{id} response gains health_check + health_status fields.

The main risks are: (1) validation errors in toDomainHealthCheck/NormalizePolicy surfacing as unexpected 400s on existing callers who send bodies that incidentally conflict with new fields, (2) the new controller goroutine panicking or leaking timers under high instance churn, and (3) exec probes firing guest-agent commands on instances that lack a guest-agent, triggering error log noise. API 5xx error rate baseline is 0.013–0.018% (30–35 errors/hr out of ~190K–280K req/hr); 400 error baseline is ~267 in the latest 4-hour window. Status updates will be posted automatically on this PR as monitoring progresses.

Key risks to watch:

  • Spike in HTTP 400 responses on /instances endpoints (invalid_health_check errors from new validation path)
  • Unhandled panics or nil pointer dereference errors in HealthCheckController.Run or timer callbacks
  • API 5xx error rate exceeding 0.05% (3× normal baseline) sustained for >15 min
  • Log errors: "failed to set health check runtime" or "health check controller started" absent after deploy

View agent

Comment thread lib/instances/health_check_controller.go
Comment thread lib/instances/types.go
Comment thread lib/instances/health_check_controller.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36c8c34. Configure here.

Comment thread lib/healthcheck/status.go
Comment thread lib/instances/health_check_controller.go
@sjmiller609 sjmiller609 merged commit 9db1f75 into main May 18, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/add-healthcheck-policy branch May 18, 2026 15:03
@firetiger-agent

Copy link
Copy Markdown

✅ Intended Effect Confirmed (3 min post-deploy)

The HealthCheckController is active and running probes. Logs show "running scheduled health check" firing continuously within 2 seconds of deploy, with health check completions streaming at ~5–10/minute. No controller panics, runtime errors, or disk write failures detected.

Early telemetry clean: API 5xx rate is 0.000%, 400 error count at 42/hr (well below the 80/hr threshold), and invocation throughput projecting at ~260K/hr (in range). We're 3 minutes into the 72-hour monitoring window. Monitoring continues.

View agent

@firetiger-agent

Copy link
Copy Markdown

✅ Monitoring Complete — No Issues Detected

The 72-hour monitoring window for PR #234 (instance health checks) has concluded on aws-publish with no regressions or deployment-related issues detected.

Summary of findings:

  • HealthCheckController: Confirmed active throughout the entire 72h window, running probes continuously with zero panics, nil-pointer errors, or disk write failures.
  • API stability: 5xx error rate post-deploy (0.000–0.363%, typical sustained: 0.002–0.012%) remains at or below pre-deploy baseline. Isolated spikes match pre-existing Unikraft browser provisioning patterns (KI-006), not health-check related.
  • 400 errors: Range 27–866/hr post-deploy, within normal pre-deploy variance (15–547/hr). No spike pattern consistent with invalid_health_check validation errors.
  • Invocation throughput: Continuous and healthy throughout (200K–550K/hr during business hours), matching pre-deploy patterns.
  • Health check runtime: Zero "failed to set health check runtime" errors logged over 72h.

Verdict: ✅ NO_ISSUE — Feature is operating as intended with no deployment impact.

View agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants