Skip to content

Decouple Sandbox Binaries / Config from ActorTemplate#261

Merged
Benjamin Elder (BenTheElder) merged 6 commits into
agent-substrate:mainfrom
BenTheElder:sandbox-bins
Jun 17, 2026
Merged

Decouple Sandbox Binaries / Config from ActorTemplate#261
Benjamin Elder (BenTheElder) merged 6 commits into
agent-substrate:mainfrom
BenTheElder:sandbox-bins

Conversation

@BenTheElder

Copy link
Copy Markdown
Collaborator

Fixes #253

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

Should make something like #239 less awkward, and also #255 (we don't have to cram more fields into actortemplate for the shim binary and update all of the demo actortemplates to include them).

@BenTheElder Benjamin Elder (BenTheElder) marked this pull request as ready for review June 17, 2026 00:43
SandboxClassGvisor SandboxClass = "gvisor"
// SandboxClassMicroVM is the micro-VM runtime (cmd/ateom-microvm); needs
// /dev/kvm and vhost devices.
SandboxClassMicroVM SandboxClass = "microvm"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't squat on the "microvm" name for one particular microvm implementation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been on the fence about this.

We could call it "cloud-hypervisor" (ignores kata, verbose) "kata" (ignores specific vm implementation).

I think it's OK to squat on this, most users that want a microvm probably don't need a more specific name, it gives us some freedom to iterate on the implementation details.

gvisor on the other hand seems to be in a class of it's own with a different model and no comparable alternatives.

If there's a really compelling additional microvm-like backend ... we'll come up with a name for it ("firecracker" ?), so far the other folks looking at this have settled on chv as well I think.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to call it "cloud-hypervisor" or "chv"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to stick with microvm for now. If this is fully extensible later then a different implementation can use the same sandbox class name and be more "portable" for users.

"chv" is too opaque to an end-user and kata + chv are currently implementation details that don't even exist yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we might want to move to an open string, but I don't think we are there yet and I didn't want to stall uVM further. When we have 2 decent ateoms we will have a better idea what extensible looks like. Modal for example offers "sandbox" (gvisor) and "vm sandbox" (chv?).

// and decouples sandbox binary selection from ActorTemplate.
//
// +genclient
// +genclient:nonNamespaced

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why non-namespaced? Wouldn't namespaced, and in the same namespace as the workerpool make sense?

A cluster scoped resource might be restrictive if you're not the admin of the Kubernetes cluster.

@BenTheElder Benjamin Elder (BenTheElder) Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why non-namespaced? Wouldn't namespaced, and in the same namespace as the workerpool make sense?

I was patterning it after StorageClass which is cluster scoped and has one tagged default.

So you could avoid configuring this repeatedly if you just want all worker pools of the same sandboxClass to use the same binaries.

It's not quite the same since you have one default SandboxConfig per sandboxClass.

A cluster scoped resource might be restrictive if you're not the admin of the Kubernetes cluster.

Yeah, at first blush I thought that might be a feature, it's pretty privileged to configure this, since we'll run the referenced binaries with high permissions.

I imagine similar to StorageClass a platform / vendor might want to configure these across all namespaces?

I don't feel super strongly about this part though.

I was mostly thinking about how to factor this out towards something semi-extensible that the casual user could ignore. I'm genuinely not sure if nonNamespaced makes sense.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly orthogonal, but I've also been mulling putting defaults into the ateom binaries, so if you really don't care you don't have to configure this at all.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you could avoid configuring this repeatedly if you just want all worker pools of the same sandboxClass to use the same binaries.

I also don't feel strongly. However, I think that actually having it namespaced does not preclude referencing it from a different namespace. So a cluster administrator could provide some default ones in a special namespace, and give everyone permission to refer to them (via some K8s RBAC, I guess)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it’s the same, but the GatewayClass is also cluster scoped. I agree that namespaced can be shared, but given that other similar resources are cluster scoped, it seems easier to start there and then fan out later. What do yall think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge and iterate on this point. It's easy enough to refactor and we have a lot to add on top.

@BenTheElder

Copy link
Copy Markdown
Collaborator Author

[trivial rebase to pickup CI fix]

@BenTheElder

Copy link
Copy Markdown
Collaborator Author

Benjamin Elder (@BenTheElder) Benjamin Elder (BenTheElder) dismissed Taahir Ahmed (ahmedtd)’s stale review via 10f24ce

This is a weird github-ism. AFAICT what I did was ... push with an additional commit? But it's phrased as if I went out of my way to "dismiss" the review ...

Default: true,
Assets: map[string]map[string]atev1alpha1.AssetFile{
"amd64": {"runsc": {
URL: "gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do these get pruned? (from the gcs bucket)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, but this value isn't new to the PR. This is what we ship in clusters. We can update both when a release has the fixes we need.

Comment thread cmd/ateapi/internal/controlapi/sandbox_assets.go
Comment thread cmd/atelet/main.go
@dims

Copy link
Copy Markdown
Collaborator

Benjamin Elder (@BenTheElder) there is consensus that anyone can create their own microvm implementation without that code living in our repo? (just making sure we are all in violent agreement).

@BenTheElder

Benjamin Elder (BenTheElder) commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Benjamin Elder (@BenTheElder) there is consensus that anyone can create their own microvm implementation without that code living in our repo? (just making sure we are all in violent agreement).

It sounds like a nice future but landing one that works first is the top priority for POC 2 ("alpha", end of June). This is going way out of the way to make the layering more prepared for that sort of thing over just landing it but it doesn't solve all of the blockers.

EDIT: IOW: yes ... eventually, but it's not clear yet if that's an ateom, an atelet or what. It's not totally clear if the ateom/atelet split will even survive, or if we should merge these two ateoms later (and let the same worker fungibly run different sandbox types, I intentionally didn't do that for now so we can figure out different ateoms / worker pools). This PR gets us closer to extensible.

…late

Introduce a cluster-scoped SandboxConfig CRD holding the sandbox binaries
(per sandbox class, per arch: url + sha256) and decouple binary selection
from the ActorTemplate:

- New SandboxConfig with SandboxClass (gvisor|microvm), a Default flag
  (cluster default per class), and Assets[arch][name]=AssetFile.
- WorkerPool gains SandboxClass + SandboxConfigName (resolve by name, else
  the cluster default for the class).
- ActorTemplate drops Runsc and the runsc/auth config types.
- Regenerated CRDs, deepcopy, and clientset/listers/informers.
- docs/api-guide.md: document SandboxConfig and the new WorkerPool fields.
RunRequest now carries a backend-agnostic SandboxAssets (sandbox_class +
assets keyed by arch then asset name). CheckpointRequest/RestoreRequest
reserve the old runsc field: checkpoint uses the version recorded on-node at
Run/Restore, and restore reads it from the snapshot manifest.
- Fetch sandbox binaries from RunRequest.SandboxAssets (content-addressed,
  cached); for gVisor this is the "runsc" asset passed to ateom as RunscPath.
- Record the running version in per-actor on-node state at Run/Restore so a
  later Checkpoint can re-fetch and pin it (ateompath.ActorSandboxAssetsFile).
- Write a snapshot manifest (manifest.json) beside the checkpoint images at
  Checkpoint and read it at Restore, so restore is self-describing across
  nodes. Adds ategcs.SendBytesToGCS for the manifest upload.
On boot-from-spec (Run), resolve the actor's WorkerPool, take its
SandboxClass, and pick the named or cluster-default SandboxConfig, sending
the result as RunRequest.SandboxAssets. Checkpoint/Restore no longer send
sandbox config (atelet uses the on-node record / snapshot manifest).

- New resolveSandboxAssets helper; thread WorkerPool + SandboxConfig listers
  through NewService/NewActorWorkflow into the resume Run step.
- Drop the RunscConfig plumbing from resume/pause/suspend.
- Grant ate-api-server get/watch/list on workerpools + sandboxconfigs.
- functional_test: create a WorkerPool + default gvisor SandboxConfig.
- Ship a cluster-default gvisor SandboxConfig (gvisor-default) carrying the
  runsc assets and apply it in install-ate.sh so gVisor pools work out of the
  box; ensure_crds also checks for the SandboxConfig CRD.
- Remove the now-removed runsc block from all demo templates and the e2e
  probe fixture; they rely on the default SandboxConfig.
- e2e demo_test: stop copying ActorTemplate.Runsc; carry the WorkerPool's
  SandboxClass/SandboxConfigName instead.
The CRD already requires each AssetFile to set url + sha256 (sha256 must be 64
hex). Add a ValidatingAdmissionPolicy enforcing the per-class requirement the
CRD schema can't express: a gvisor SandboxConfig must define a "runsc" asset
for every architecture under spec.assets (fail-closed at apply time, instead of
only failing later when an actor tries to boot). Applied before the default
SandboxConfig in install-ate.sh so the defaults are validated too.

Add an envtest test that loads the shipped policy, waits for it to activate, and
covers both the VAP (gvisor must have runsc) and the CRD-level url/sha256
requirements.
SHA256 string `json:"sha256"`
}

// SandboxConfigSpec is the desired state of a SandboxConfig.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I originally read the issue I think I was envisioning something more akin to SandboxClass. Wherein the Sandbox class would define the assets, and there could be any number of them with arbitrary names. This one seems to instead have hardcoded class names. If we're going to use hardcoded class names I have a couple of follow-up questions:

  1. why not make the config blocks for gvisor vs ch explicit rather than a generic map?
  2. Why would a user need multiple SandboxConfig?

@BenTheElder Benjamin Elder (BenTheElder) Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] This one seems to instead have hardcoded class names. [...]

To start, because it's not really possible to bring a custom one yet. But you can see how that would update pretty easily (here, the other parts are trickier)

why not make the config blocks for gvisor vs ch explicit rather than a generic map?

It's not really necessary so far. They both just need a set of assets which should be recorded alongside the snapshot. A per-class VAP (see last commit) can ensure you have the right list of assets while the generic config shape can make sure you have set url+sha etc.

I think other properties of the sandbox, like say, memory capacity, volume mounts, will be on the actortemplate side.

Why would a user need multiple SandboxConfig?

You can set different SandboxConfig on different workerpools. For example to A/B a new gvisor release.

EDIT: Which to an ActorTemplate is still the same "sandbox class", it's more akin to a different release of that class. Everything in SandboxConfig is a versioned asset for a sandboxClass. It might need a different name ...

@BenTheElder Benjamin Elder (BenTheElder) merged commit 00a90a8 into agent-substrate:main Jun 17, 2026
8 checks passed
@BenTheElder Benjamin Elder (BenTheElder) deleted the sandbox-bins branch June 17, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple Sandbox Binaries From Actor Template

4 participants