Documentation Index
Fetch the complete documentation index at: https://docs.stacyide.xyz/llms.txt
Use this file to discover all available pages before exploring further.
Phase 2 Observability And Ops Release Notes
Date: 2026-05-08 Branch:phase-2-observability-and-ops
Summary
Phase 2 turns the Phase 1 foundation into an operable production surface. The API now exposes liveness, readiness, diagnostics, structured metrics, Prometheus scraping, richer provider health, operational audit events, and configurable runtime limits. The goal of this phase is to make StacyVM easier to run, debug, monitor, and safely scale before deeper multi-tenant and production deployment work.What Changed
Liveness And Readiness
- Added
/api/v1/livefor process liveness checks. - Added
/api/v1/readyfor dependency readiness checks. - Readiness now reports provider health instead of only returning a generic process status.
Structured Runtime Metrics
- Added an in-process operation metrics recorder.
- Operations tracked include:
- spawn
- exec
- exec stream
- destroy
- file write, read, list, delete, move, chmod, stat, and glob
- Each operation tracks:
- success count
- failure count
- latency count
- total latency
- min, max, and average latency
- last error
- last observed timestamp
/api/v1/metricsnow includes sandbox, provider, event, process, runtime, and operation metrics.
Prometheus Metrics
- Added
/api/v1/metrics/prometheus. - The Prometheus endpoint exposes:
- process uptime
- goroutines
- memory and GC metrics
- sandbox counts by state and provider
- provider health
- provider health latency
- provider runtime inventory counts
- event bus stats
- operation success/failure and latency counters
Operational Audit Events
- Added event IDs for published events.
- Added operational event types:
exec.failedexec.timeoutoperation.failedresource.limitprovider.failedreconcile.action
- Manager paths now publish audit events for:
- exec failures and timeouts
- stream exec timeouts
- file operation failures
- spawn/provider/resource failures
- destroy provider failures
- reconciliation actions and provider inventory failures
Provider Health Detail
- Provider health now includes:
latency_mslast_checkederrorcapabilitiesruntime_countwhen runtime inventory is supported
- Provider health detail is shared across:
/api/v1/ready/api/v1/metrics/api/v1/metrics/prometheus/api/v1/providers/api/v1/providers/{name}
Redacted Diagnostics
- Added
/api/v1/diagnostics. - Diagnostics include:
- generated timestamp
- version/build info
- GOOS/GOARCH
- uptime, goroutines, memory, and GC cycles
- store health and latency
- active operational limits
- detailed provider health
- sandbox counts by state/provider
- event bus stats
- operation metrics
- explicit redaction categories
- Diagnostics are read-only and intentionally avoid returning API keys, registry credentials, provider secrets, or environment secrets.
Operational Limits
- Added configurable defaults:
defaults.max_ttldefaults.default_exec_timeoutdefaults.max_exec_timeoutdefaults.max_sandboxesdefaults.max_sandboxes_per_owner
- Manager now centrally enforces:
- max TTL
- max total active sandboxes
- max active sandboxes per owner
- default exec timeout
- max exec timeout
- Limit violations return typed resource-limit errors and publish
resource.limitaudit events.
Code Changes By Area
API Routes
internal/api/routes/system.go- Added liveness, readiness, diagnostics, JSON metrics, and Prometheus metrics behavior.
internal/api/routes/provider_health.go- Added shared provider health detail collection.
internal/api/routes/prometheus.go- Added Prometheus text renderer.
internal/api/routes/providers.go- Added detailed health to provider list/detail responses.
internal/api/routes/system_test.go- Added coverage for readiness, diagnostics, metrics, and Prometheus output.
Orchestrator
internal/orchestrator/metrics.go- Added operation metrics recorder.
internal/orchestrator/manager.go- Added metrics recording, audit event publishing, and operational limit enforcement.
internal/orchestrator/events.go- Added event IDs and operational event types.
internal/orchestrator/types.go- Added operational limit types.
internal/orchestrator/manager_test.go- Added tests for operation metrics, audit events, TTL limits, sandbox limits, owner limits, and exec timeout limits.
Config And Docs
internal/config/config.go- Added default config fields for operational limits.
cmd/stacyvm/cmd_serve.go- Wires configured operational limits into the manager.
README.md- Documents new operational limit config.
docs/api.md- Documents liveness, readiness, diagnostics, metrics, Prometheus metrics, provider health detail, and operational event shape.
CHANGELOG.md- Adds this Phase 2 checkpoint entry.
Verification
The following checks passed:Impact
Phase 2 gives StacyVM the baseline visibility and guardrails needed to operate safely:- Operators can distinguish liveness from readiness.
- Dashboards can consume JSON or Prometheus metrics.
- Support/debug flows can use a redacted diagnostics endpoint.
- Provider health is actionable rather than a single boolean.
- Resource pressure and failure modes are visible through events.
- Runtime limits can prevent accidental overload before full multi-tenant quota systems arrive.

