Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.stacyide.xyz/llms.txt

Use this file to discover all available pages before exploring further.

Phase 10 Multi-Worker Foundation Release Notes

Date: 2026-05-09 Branch: phase-10-multi-worker-foundation

Summary

Phase 10 starts the enterprise and multi-worker production track. This slice adds the durable worker registry foundation StacyVM needs before scheduler placement, worker ownership, leases, and remote worker RPC can be made production-grade. This is not a full distributed runtime yet. It is the first production-aligned control-plane layer for observing workers, recording heartbeats, and exposing that state through APIs, diagnostics, and metrics.

What Changed

Worker Registry Storage

  • Added a SQLite migration for the workers table.
  • Added durable worker fields for ID, hostname, status, providers, capabilities, capacity, heartbeat timestamp, and lifecycle timestamps.
  • Added store methods for saving, fetching, listing, and deleting worker records.

Local Worker Registration

  • The API server now registers the current process as the local worker at startup.
  • The API server now refreshes the local worker heartbeat periodically while running.
  • Server shutdown stops the heartbeat loop cleanly.
  • The local record includes configured providers, single-node capabilities, and manager capacity limits.
  • Single-node deployments now appear in the same worker registry surface that future multi-worker deployments will use.

Worker API

  • Added read-only worker discovery:
    • GET /api/v1/workers
    • GET /api/v1/workers/{workerID}
  • Added admin-only worker mutations:
    • POST /api/v1/admin/workers/{workerID}/heartbeat
    • DELETE /api/v1/admin/workers/{workerID}
  • Worker responses include a computed stale flag when the last heartbeat is older than the freshness window.

Sandbox Worker Ownership

  • Added persisted worker_id ownership to sandbox records.
  • New and adopted local sandboxes are stamped with the active worker ID.
  • Scheduler status now reports the current worker ID.
  • Sandbox API responses now include worker_id when ownership is known.

Worker-Aware Scheduler Placement

  • Spawn admission now evaluates worker placement using worker status, heartbeat freshness, provider support, declared capacity, and active sandbox counts.
  • Scheduler status now reports the selected worker and number of eligible workers.
  • Local execution remains honest: if the scheduler would place work on a remote worker, admission reports remote_worker_rpc_unavailable until the worker RPC slice lands.
  • Stale local worker records are no longer special-cased; real server runs keep the local worker fresh through the heartbeat loop.

Distributed Lease Foundation

  • Added durable lease records for resource ownership fencing.
  • Added store APIs to acquire, renew, release, get, and list leases.
  • Lease acquisition is holder-aware and expiry-aware: a competing worker cannot acquire an unexpired lease held by another worker.
  • Lease renewals require the current holder and an unexpired lease.
  • Diagnostics and Prometheus now expose lease totals so operators can inspect active and expired lease state.

Lease Enforcement

  • Local spawns now acquire a sandbox lease before persisting the sandbox record.
  • Runtime adoption during reconciliation now acquires a sandbox lease before adopting unknown provider runtimes.
  • Pool VM and pooled logical sandbox creation now acquire leases.
  • Destroy now acquires or renews the local worker lease before mutating provider/runtime/store state.
  • Successful destroy releases the sandbox lease.
  • Wrong-holder lease tests now prevent local destroy from mutating a sandbox owned by another worker.

Worker RPC Contract And Auth Model

  • Added internal/workerproto with transport-neutral worker request and response envelopes.
  • Defined contract methods for heartbeat, spawn, destroy, status, lease renewal, and shutdown.
  • Mutating worker assignments require a lease token in the message contract.
  • Added transport-neutral worker auth claims and initial scopes.
  • Documented the worker trust boundary, suggested headers, lease fencing rules, and Postgres cluster-store guarantees in docs/worker-rpc-contract.md.
  • Remote worker execution remains gated until a network transport enforces this contract.

Diagnostics And Metrics

  • Diagnostics now include worker totals, online count, stale count, unhealthy count, and worker items.
  • Diagnostics now include lease totals, active count, expired count, and active leases by holder.
  • Diagnostics sandbox summaries now include by_worker counts.
  • Prometheus output now includes:
    • stacyvm_workers_total{status="total"}
    • stacyvm_workers_total{status="online"}
    • stacyvm_workers_total{status="stale"}
    • stacyvm_workers_total{status="unhealthy"}
    • stacyvm_leases_total{status="active"}
    • stacyvm_sandboxes_by_worker_total{worker="local"}

Documentation

  • Updated the changelog with Phase 10 changes.
  • Updated the API reference with worker endpoints and metrics.
  • Updated the README endpoint table with worker discovery.
  • Updated the production readiness checklist with Phase 10 acceptance criteria.

Code Areas

  • internal/store: worker model, lease model, migrations, SQLite CRUD, and sandbox worker_id persistence.
  • internal/workerproto: worker RPC contract and auth claim types.
  • internal/api/routes: worker routes, diagnostics worker summary, and Prometheus worker metrics.
  • internal/api/server.go: local worker startup registration, heartbeat refresh loop, and route mounting.
  • docs: API, README, changelog, production readiness, and release notes.

Verification

  • go test ./internal/store ./internal/api/routes ./internal/api
  • scripts/check-swagger.sh
  • go test ./...
  • git diff --check

Remaining Phase 10 Direction

Phase 10 is complete as a foundation branch. Follow-up phases should implement the network worker daemon/transport, Postgres-backed cluster store, remote lifecycle conformance tests, and OIDC/RBAC for enterprise production.