Documentation Index
Fetch the complete documentation index at: https://docs.stacyide.xyz/llms.txt
Use this file to discover all available pages before exploring further.
Worker RPC Contract
Phase 10 defined the control-plane to worker contract. Phase 11 wired that contract into a real worker runtime:stacyvm worker can authenticate to the control plane, submit heartbeat state through a worker-only HTTP endpoint, and expose an optional inbound RPC server with --listen.
Phase 12 starts remote sandbox I/O routing. Remote spawn, status, destroy, lease renewal, shutdown/drain, non-streaming exec, live exec-stream calls, file APIs, and console logs now use the worker RPC transport when the scheduler selects a non-local worker that advertises rpc_url.
Contract Package
The Go contract lives ininternal/workerproto.
Core envelope:
| Method | Direction | Lease required | Purpose |
|---|---|---|---|
worker.heartbeat | worker to control plane | No | Report liveness, providers, capabilities, and capacity. |
worker.spawn | control plane to worker | Yes | Assign sandbox creation to the selected worker. |
worker.destroy | control plane to worker | Yes | Assign sandbox teardown to the owning worker. |
worker.status | control plane to worker | No | Ask a worker for runtime state. |
worker.exec | control plane to worker | No | Run a non-streaming command in an owned runtime. |
worker.exec_stream | control plane to worker | No | Run a command and stream stdout/stderr chunks. |
worker.file_write | control plane to worker | No | Write file content in an owned runtime. |
worker.file_read | control plane to worker | No | Read file content from an owned runtime. |
worker.file_list | control plane to worker | No | List files in an owned runtime. |
worker.file_delete | control plane to worker | No | Delete a file or directory in an owned runtime. |
worker.file_move | control plane to worker | No | Move or rename a file in an owned runtime. |
worker.file_chmod | control plane to worker | No | Change file mode in an owned runtime. |
worker.file_stat | control plane to worker | No | Stat a file in an owned runtime. |
worker.file_glob | control plane to worker | No | Evaluate a glob pattern in an owned runtime. |
worker.logs | control plane to worker | No | Return console log lines from an owned runtime. |
worker.renew_lease | control plane to worker | Yes | Confirm continued ownership and renew fencing. |
worker.shutdown | control plane to worker | No | Drain or stop a worker process. |
Lease Fencing
Every mutating lifecycle assignment must include a lease token:resource_idis the sandbox ID.holder_idmust match the selected worker.generationis incremented whenever ownership is acquired or renewed.expires_atdefines when another worker may take over.
Worker Authentication
Worker identity must be separate from user and admin identity. Recommended transport headers for the future network worker:| Header | Purpose |
|---|---|
X-Worker-ID | Stable worker ID that must match the token subject. |
X-Worker-Token or Authorization: Bearer <token> | Worker token signed by the control plane or trusted issuer. |
X-Request-ID | Request correlation across control plane and worker logs. |
workerproto.AuthClaims:
worker_idscopesexpires
worker:heartbeatworker:spawnworker:destroyworker:statusworker:execworker:filesworker:logsworker:lease
auth.worker_tokenas a shared staging token.auth.worker_token_fileas a secret-mounted file containing the shared staging token.auth.worker_tokens.<worker_id>as a per-worker token map for production-aligned staging.auth.worker_signing_keyfor HMAC-SHA256 signed worker tokens using thestacyvm-worker-v1.<payload>.<signature>format.auth.worker_signing_key_fileas a secret-mounted file containing the active signing key.auth.worker_signing_keysas additional verification keys accepted during signing-key rotation.
auth.worker_tokens, that worker-specific token takes precedence and the shared token is rejected for that worker ID. This keeps legacy staging configs compatible while giving production deployments individually rotatable worker credentials.
Signed worker token payloads are base64url JSON claims with worker_id, jti, aud, optional worker scopes, iat, optional nbf, and exp. The authenticated X-Worker-ID must match the signed worker_id, expired or not-yet-valid tokens are rejected, and any non-worker scopes are ignored. Tokens with iat are capped at a 15 minute lifetime with 30 seconds of clock skew tolerance. stacyvm worker can derive short-lived heartbeat and lease-renewal tokens with aud=worker:control-plane from auth.worker_signing_key when no static --worker-token or auth.worker_token is provided. Operators can also issue a token explicitly with stacyvm worker token <worker-id> --ttl 5m --format json to capture the token ID and expiry metadata.
The same signed token format is accepted by worker RPC servers for control-plane-to-worker calls, but RPC tokens use aud=worker:rpc. When the control plane has auth.worker_signing_key and no shared auth.worker_token, it mints short-lived RPC-audience tokens for the target worker before calling /rpc. This lets remote spawn, status, exec, file, log, preview, and destroy routing avoid static shared worker RPC credentials.
Issued tokens include a jti token ID. During an incident, add a compromised token ID to auth.worker_revoked_token_ids; both worker-to-control-plane routes and control-plane-to-worker RPC reject matching signed tokens.
Token incident-response runbook:
stacyvm worker token inspect decodes token metadata without verifying the signature. Use it to recover worker_id, jti, aud, and expiry metadata from an already-captured token before adding the jti value to auth.worker_revoked_token_ids; do not treat inspected claims as authenticated identity. stacyvm worker token verify validates the signature against auth.worker_signing_key, accepts auth.worker_signing_keys during rotation, applies optional --worker-id and --audience expectations, and rejects configured revoked token IDs.
For production services, prefer auth.worker_token_file, auth.worker_signing_key_file, --worker-token-file, --worker-signing-key-file, --signing-key-file, and --verification-key-file with secret-mounted files over passing long-lived worker secrets directly in YAML, shell history, or environment variables. The config loader rejects ambiguous pairs such as auth.worker_signing_key plus auth.worker_signing_key_file, and stacyvm config lint --production reports whether worker token and signing-key values came from secret file references or inline config. stacyvm worker --worker-token-file reloads the token file for each heartbeat and lease-renewal request, so an external issuer or sidecar can rotate short-lived signed worker tokens without restarting the worker process.
stacyvm worker token rotation-plan emits a no-secret checklist, config sketch, and validation commands for a two-key rotation window. No-downtime signing-key rotation uses this sequence:
- Set the new key as
auth.worker_signing_key. - Move the previous key into
auth.worker_signing_keys. - Restart or reload workers so they mint with the new key.
- Wait until all old worker tokens have expired.
- Remove the old key from
auth.worker_signing_keys.
stacyvm config lint --production warns when the active signing key is repeated in auth.worker_signing_keys, when duplicate rotation keys are configured, or when a shared auth.worker_token remains configured beside signed worker tokens.
Worker RPC mTLS
Signed worker tokens authenticate the worker identity at the application layer. Enterprise deployments should also protect worker RPC transport with mTLS when worker RPC crosses a host or network boundary. Worker RPC TLS is opt-in throughworker.rpc_tls:
server_cert_file and server_key_file serve the inbound /rpc endpoint. When client_ca_file is set, the worker requires and verifies a client certificate from the control plane.
On control-plane nodes, ca_file verifies worker server certificates, client_cert_file and client_key_file present the control-plane client identity, and server_name pins the expected worker certificate name when DNS or advertised rpc_url hostnames differ.
insecure_skip_verify exists only for throwaway local tests and should fail production config lint.
Current Phase 11 heartbeat endpoint:
X-Worker-ID plus X-Worker-Token and rejects requests where the authenticated worker ID differs from the {workerID} path.
Current Phase 11 worker RPC endpoint:
workerproto.Request envelopes, requires the same worker headers, and currently implements worker.status, worker.exec, worker.exec_stream, file operations, worker.logs, worker.renew_lease, worker.spawn, worker.destroy, and worker.shutdown.
For worker.spawn, the request carries a control-plane sandbox_id and the response returns both that ID and the provider runtime_id. The control plane should persist that mapping before routing later status, exec, file, or destroy operations to the owning worker.
Remote workers advertise their control-plane callback endpoint through heartbeat capacity:
rpc_url and signed worker RPC tokens or a static worker token are configured, the control plane acquires the sandbox lease for that worker, calls worker.spawn, persists the selected worker_id, and stores the returned provider runtime ID for later routing.
Sandbox reads use the persisted worker_id and provider runtime_id to call worker.status on the owning worker. If the worker reports a changed state, the control plane updates its stored sandbox state. If the worker is temporarily unreachable, the control plane keeps serving the cached record and logs the refresh failure at debug level.
Remote destroy uses the same persisted ownership tuple. The control plane fetches the durable sandbox lease, presents it to worker.destroy, updates sandbox state to destroyed, and releases the lease after the worker confirms teardown.
Remote non-streaming exec uses the same persisted ownership tuple without acquiring a new lifecycle lease. The control plane sends command, argv mode, environment, workdir, timeout, provider, sandbox ID, and provider runtime ID to worker.exec. The worker runs the command against its local provider registry and returns exit code, stdout, and stderr. The control plane still writes normal exec logs and emits the same audit, event, metric, and timeout behavior used by local exec.
Remote exec stream uses worker.exec_stream with X-Worker-Stream: ndjson. The worker flushes each stdout/stderr chunk as an NDJSON workerproto.Response, and the control plane exposes those chunks through the manager’s existing stream channel API. Clients that do not request NDJSON can still receive the buffered ExecStreamResult response shape.
Remote file APIs use the same ownership tuple. The control plane validates/scopes paths, then sends the provider runtime ID and requested file operation to the owning worker. Dedicated remote sandboxes are not treated as pool sandboxes just because VMID stores the provider runtime ID; pool workspace scoping remains local-pool only.
Remote console logs use worker.logs with the persisted provider runtime ID, so workers read logs from the runtime they actually own instead of the control-plane sandbox ID.
Remote preview metadata uses capacity.preview_domain from the owning worker. The control plane returns that domain on remote-owned sandboxes so SDKs and the dashboard build URLs for the worker or cluster ingress that can actually reach the runtime. If a worker does not advertise a preview domain, the control plane falls back to server.preview_domain.
worker.shutdown is a drain signal. After receiving it, the worker rejects new worker.spawn assignments and reports draining in future heartbeats, which keeps it out of scheduler placement. Existing sandboxes keep their worker ownership while the worker is fresh and draining.
Startup reconciliation applies a conservative remote ownership policy:
- Fresh draining workers keep existing sandbox ownership.
- Stale, offline, or missing workers cause non-expired remote-owned sandboxes to become
unhealthy. - Expired remote-owned sandboxes become
expiredand release their durable lease. - The control plane does not pretend to migrate a stateful runtime to another worker. Real reassignment requires provider-level snapshot or migration support.
Cluster Store Semantics
SQLite remains suitable for single-node and local development. Enterprise multi-worker mode should use Postgres or another store with equivalent guarantees. Required lease guarantees:- Atomic acquire by
resource_id. - Acquire succeeds when no lease exists, the lease is expired, or the same holder renews ownership.
- Acquire fails when a different holder owns an unexpired lease.
- Renew succeeds only for the current holder and only before expiry.
- Release succeeds only for the current holder.
- Concurrent acquire attempts must serialize on the lease row.
resource_id, transactional upsert semantics, and row-level contention safety. Clock skew must be bounded because expiry is time-based.
Current Limits
Remote placement returnsremote_worker_rpc_unavailable unless the selected worker advertises rpc_url and the control plane can authenticate to worker RPC with either short-lived signed RPC-audience tokens or a static worker token. Postgres-backed cluster storage and signed production worker identity are wired into the current transport; deployment certification still needs the target host/runtime conformance checks listed in cluster-conformance.
