Worker RPC Contract

Phase 10 defined the control-plane to worker contract. Phase 11 wired that contract into a real worker runtime: stacyvm worker can authenticate to the control plane, submit heartbeat state through a worker-only HTTP endpoint, and expose an optional inbound RPC server with --listen. Phase 12 starts remote sandbox I/O routing. Remote spawn, status, destroy, lease renewal, shutdown/drain, non-streaming exec, live exec-stream calls, file APIs, and console logs now use the worker RPC transport when the scheduler selects a non-local worker that advertises rpc_url.

Contract Package

The Go contract lives in internal/workerproto. Core envelope:

{
  "id": "req-123",
  "method": "worker.spawn",
  "worker_id": "worker-a",
  "lease": {
    "resource_id": "sb-abc123",
    "holder_id": "worker-a",
    "generation": 4,
    "expires_at": "2026-05-09T10:31:00Z"
  },
  "params": {}
}

Supported methods:

Method	Direction	Lease required	Purpose
`worker.heartbeat`	worker to control plane	No	Report liveness, providers, capabilities, and capacity.
`worker.spawn`	control plane to worker	Yes	Assign sandbox creation to the selected worker.
`worker.destroy`	control plane to worker	Yes	Assign sandbox teardown to the owning worker.
`worker.status`	control plane to worker	No	Ask a worker for runtime state.
`worker.exec`	control plane to worker	No	Run a non-streaming command in an owned runtime.
`worker.exec_stream`	control plane to worker	No	Run a command and stream stdout/stderr chunks.
`worker.file_write`	control plane to worker	No	Write file content in an owned runtime.
`worker.file_read`	control plane to worker	No	Read file content from an owned runtime.
`worker.file_list`	control plane to worker	No	List files in an owned runtime.
`worker.file_delete`	control plane to worker	No	Delete a file or directory in an owned runtime.
`worker.file_move`	control plane to worker	No	Move or rename a file in an owned runtime.
`worker.file_chmod`	control plane to worker	No	Change file mode in an owned runtime.
`worker.file_stat`	control plane to worker	No	Stat a file in an owned runtime.
`worker.file_glob`	control plane to worker	No	Evaluate a glob pattern in an owned runtime.
`worker.logs`	control plane to worker	No	Return console log lines from an owned runtime.
`worker.renew_lease`	control plane to worker	Yes	Confirm continued ownership and renew fencing.
`worker.shutdown`	control plane to worker	No	Drain or stop a worker process.

Lease Fencing

Every mutating lifecycle assignment must include a lease token:

resource_id is the sandbox ID.
holder_id must match the selected worker.
generation is incremented whenever ownership is acquired or renewed.
expires_at defines when another worker may take over.

Workers must reject mutating work if the lease holder does not match their own worker ID or if the lease is expired. The control plane must renew leases before long-running operations cross the expiry window.

Worker Authentication

Worker identity must be separate from user and admin identity. Recommended transport headers for the future network worker:

Header	Purpose
`X-Worker-ID`	Stable worker ID that must match the token subject.
`X-Worker-Token` or `Authorization: Bearer <token>`	Worker token signed by the control plane or trusted issuer.
`X-Request-ID`	Request correlation across control plane and worker logs.

Validated worker tokens should produce workerproto.AuthClaims:

worker_id
scopes
expires

Initial scopes:

worker:heartbeat
worker:spawn
worker:destroy
worker:status
worker:exec
worker:files
worker:logs
worker:lease

Workers must not accept user API keys or admin API keys for worker RPC. Control-plane admin access and worker execution access are separate trust boundaries. Current control-plane worker authentication accepts either:

auth.worker_token as a shared staging token.
auth.worker_token_file as a secret-mounted file containing the shared staging token.
auth.worker_tokens.<worker_id> as a per-worker token map for production-aligned staging.
auth.worker_signing_key for HMAC-SHA256 signed worker tokens using the stacyvm-worker-v1.<payload>.<signature> format.
auth.worker_signing_key_file as a secret-mounted file containing the active signing key.
auth.worker_signing_keys as additional verification keys accepted during signing-key rotation.

When a worker has an entry in auth.worker_tokens, that worker-specific token takes precedence and the shared token is rejected for that worker ID. This keeps legacy staging configs compatible while giving production deployments individually rotatable worker credentials. Signed worker token payloads are base64url JSON claims with worker_id, jti, aud, optional worker scopes, iat, optional nbf, and exp. The authenticated X-Worker-ID must match the signed worker_id, expired or not-yet-valid tokens are rejected, and any non-worker scopes are ignored. Tokens with iat are capped at a 15 minute lifetime with 30 seconds of clock skew tolerance. stacyvm worker can derive short-lived heartbeat and lease-renewal tokens with aud=worker:control-plane from auth.worker_signing_key when no static --worker-token or auth.worker_token is provided. Operators can also issue a token explicitly with stacyvm worker token <worker-id> --ttl 5m --format json to capture the token ID and expiry metadata. The same signed token format is accepted by worker RPC servers for control-plane-to-worker calls, but RPC tokens use aud=worker:rpc. When the control plane has auth.worker_signing_key and no shared auth.worker_token, it mints short-lived RPC-audience tokens for the target worker before calling /rpc. This lets remote spawn, status, exec, file, log, preview, and destroy routing avoid static shared worker RPC credentials. Issued tokens include a jti token ID. During an incident, add a compromised token ID to auth.worker_revoked_token_ids; both worker-to-control-plane routes and control-plane-to-worker RPC reject matching signed tokens. Token incident-response runbook:

stacyvm worker token worker-a --signing-key-file /run/secrets/stacyvm-worker-signing-key --ttl 5m --format json
stacyvm worker token inspect '<signed-worker-token>'
stacyvm worker token verify '<signed-worker-token>' --signing-key-file /run/secrets/stacyvm-worker-signing-key --worker-id worker-a --audience worker:control-plane
stacyvm worker token rotation-plan --new-key-ref /run/secrets/stacyvm-worker-signing-key-new --previous-key-ref /run/secrets/stacyvm-worker-signing-key-old --ttl 5m

stacyvm worker token inspect decodes token metadata without verifying the signature. Use it to recover worker_id, jti, aud, and expiry metadata from an already-captured token before adding the jti value to auth.worker_revoked_token_ids; do not treat inspected claims as authenticated identity. stacyvm worker token verify validates the signature against auth.worker_signing_key, accepts auth.worker_signing_keys during rotation, applies optional --worker-id and --audience expectations, and rejects configured revoked token IDs. For production services, prefer auth.worker_token_file, auth.worker_signing_key_file, --worker-token-file, --worker-signing-key-file, --signing-key-file, and --verification-key-file with secret-mounted files over passing long-lived worker secrets directly in YAML, shell history, or environment variables. The config loader rejects ambiguous pairs such as auth.worker_signing_key plus auth.worker_signing_key_file, and stacyvm config lint --production reports whether worker token and signing-key values came from secret file references or inline config. stacyvm worker --worker-token-file reloads the token file for each heartbeat and lease-renewal request, so an external issuer or sidecar can rotate short-lived signed worker tokens without restarting the worker process. stacyvm worker token rotation-plan emits a no-secret checklist, config sketch, and validation commands for a two-key rotation window. No-downtime signing-key rotation uses this sequence:

Set the new key as auth.worker_signing_key.
Move the previous key into auth.worker_signing_keys.
Restart or reload workers so they mint with the new key.
Wait until all old worker tokens have expired.
Remove the old key from auth.worker_signing_keys.

stacyvm config lint --production warns when the active signing key is repeated in auth.worker_signing_keys, when duplicate rotation keys are configured, or when a shared auth.worker_token remains configured beside signed worker tokens.

Worker RPC mTLS

Signed worker tokens authenticate the worker identity at the application layer. Enterprise deployments should also protect worker RPC transport with mTLS when worker RPC crosses a host or network boundary. Worker RPC TLS is opt-in through worker.rpc_tls:

worker:
  listen_addr: "0.0.0.0:7430"
  rpc_tls:
    enabled: true
    server_cert_file: "/etc/stacyvm/tls/worker.crt"
    server_key_file: "/etc/stacyvm/tls/worker.key"
    client_ca_file: "/etc/stacyvm/tls/control-plane-ca.crt"
    ca_file: "/etc/stacyvm/tls/worker-ca.crt"
    client_cert_file: "/etc/stacyvm/tls/control-plane.crt"
    client_key_file: "/etc/stacyvm/tls/control-plane.key"
    server_name: "worker-a.internal"
    insecure_skip_verify: false

On worker nodes, server_cert_file and server_key_file serve the inbound /rpc endpoint. When client_ca_file is set, the worker requires and verifies a client certificate from the control plane. On control-plane nodes, ca_file verifies worker server certificates, client_cert_file and client_key_file present the control-plane client identity, and server_name pins the expected worker certificate name when DNS or advertised rpc_url hostnames differ. insecure_skip_verify exists only for throwaway local tests and should fail production config lint. Current Phase 11 heartbeat endpoint:

POST /api/v1/worker/{workerID}/heartbeat

The endpoint requires X-Worker-ID plus X-Worker-Token and rejects requests where the authenticated worker ID differs from the {workerID} path. Current Phase 11 worker RPC endpoint:

POST /rpc

Run it with:

stacyvm worker --listen 127.0.0.1:7430

The endpoint accepts workerproto.Request envelopes, requires the same worker headers, and currently implements worker.status, worker.exec, worker.exec_stream, file operations, worker.logs, worker.renew_lease, worker.spawn, worker.destroy, and worker.shutdown. For worker.spawn, the request carries a control-plane sandbox_id and the response returns both that ID and the provider runtime_id. The control plane should persist that mapping before routing later status, exec, file, or destroy operations to the owning worker. Remote workers advertise their control-plane callback endpoint through heartbeat capacity:

{
  "capacity": {
    "max_sandboxes": 10,
    "rpc_url": "http://worker-a.internal:7430",
    "preview_domain": "worker-a.preview.example.com"
  }
}

When the scheduler selects a non-local worker with rpc_url and signed worker RPC tokens or a static worker token are configured, the control plane acquires the sandbox lease for that worker, calls worker.spawn, persists the selected worker_id, and stores the returned provider runtime ID for later routing. Sandbox reads use the persisted worker_id and provider runtime_id to call worker.status on the owning worker. If the worker reports a changed state, the control plane updates its stored sandbox state. If the worker is temporarily unreachable, the control plane keeps serving the cached record and logs the refresh failure at debug level. Remote destroy uses the same persisted ownership tuple. The control plane fetches the durable sandbox lease, presents it to worker.destroy, updates sandbox state to destroyed, and releases the lease after the worker confirms teardown. Remote non-streaming exec uses the same persisted ownership tuple without acquiring a new lifecycle lease. The control plane sends command, argv mode, environment, workdir, timeout, provider, sandbox ID, and provider runtime ID to worker.exec. The worker runs the command against its local provider registry and returns exit code, stdout, and stderr. The control plane still writes normal exec logs and emits the same audit, event, metric, and timeout behavior used by local exec. Remote exec stream uses worker.exec_stream with X-Worker-Stream: ndjson. The worker flushes each stdout/stderr chunk as an NDJSON workerproto.Response, and the control plane exposes those chunks through the manager’s existing stream channel API. Clients that do not request NDJSON can still receive the buffered ExecStreamResult response shape. Remote file APIs use the same ownership tuple. The control plane validates/scopes paths, then sends the provider runtime ID and requested file operation to the owning worker. Dedicated remote sandboxes are not treated as pool sandboxes just because VMID stores the provider runtime ID; pool workspace scoping remains local-pool only. Remote console logs use worker.logs with the persisted provider runtime ID, so workers read logs from the runtime they actually own instead of the control-plane sandbox ID. Remote preview metadata uses capacity.preview_domain from the owning worker. The control plane returns that domain on remote-owned sandboxes so SDKs and the dashboard build URLs for the worker or cluster ingress that can actually reach the runtime. If a worker does not advertise a preview domain, the control plane falls back to server.preview_domain. worker.shutdown is a drain signal. After receiving it, the worker rejects new worker.spawn assignments and reports draining in future heartbeats, which keeps it out of scheduler placement. Existing sandboxes keep their worker ownership while the worker is fresh and draining. Startup reconciliation applies a conservative remote ownership policy:

Fresh draining workers keep existing sandbox ownership.
Stale, offline, or missing workers cause non-expired remote-owned sandboxes to become unhealthy.
Expired remote-owned sandboxes become expired and release their durable lease.
The control plane does not pretend to migrate a stateful runtime to another worker. Real reassignment requires provider-level snapshot or migration support.

Current Phase 11 control-plane lease renewal endpoint:

POST /api/v1/worker/{workerID}/leases/{resourceID}/renew

The worker RPC handler validates the presented lease token before calling this endpoint. The control plane only renews unexpired leases held by the authenticated worker.

Cluster Store Semantics

SQLite remains suitable for single-node and local development. Enterprise multi-worker mode should use Postgres or another store with equivalent guarantees. Required lease guarantees:

Atomic acquire by resource_id.
Acquire succeeds when no lease exists, the lease is expired, or the same holder renews ownership.
Acquire fails when a different holder owns an unexpired lease.
Renew succeeds only for the current holder and only before expiry.
Release succeeds only for the current holder.
Concurrent acquire attempts must serialize on the lease row.

In Postgres terms, lease acquire should be implemented with a unique key on resource_id, transactional upsert semantics, and row-level contention safety. Clock skew must be bounded because expiry is time-based.

Current Limits

Remote placement returns remote_worker_rpc_unavailable unless the selected worker advertises rpc_url and the control plane can authenticate to worker RPC with either short-lived signed RPC-audience tokens or a static worker token. Postgres-backed cluster storage and signed production worker identity are wired into the current transport; deployment certification still needs the target host/runtime conformance checks listed in cluster-conformance.

Get Started

Build With StacyVM

Operate

Production Readiness

Runtime And Conformance

Architecture

Release Notes

Worker rpc contract

Worker RPC Contract

Contract Package

Lease Fencing

Worker Authentication

Worker RPC mTLS

Cluster Store Semantics

Current Limits

Get Started

Build With StacyVM

Operate

Production Readiness

Runtime And Conformance

Architecture

Release Notes

Documentation Index

​Worker RPC Contract

​Contract Package

​Lease Fencing

​Worker Authentication

​Worker RPC mTLS

​Cluster Store Semantics

​Current Limits

Worker RPC Contract

Contract Package

Lease Fencing

Worker Authentication

Worker RPC mTLS

Cluster Store Semantics

Current Limits