Decouple Sandbox Binaries / Config from ActorTemplate by BenTheElder · Pull Request #261 · agent-substrate/substrate

Benjamin Elder (BenTheElder) · 2026-06-17T00:37:22Z

Fixes #253

It's a good idea to open an issue first for discussion.

Tests pass
Appropriate changes to documentation are included in the PR

Should make something like #239 less awkward, and also #255 (we don't have to cram more fields into actortemplate for the shim binary and update all of the demo actortemplates to include them).

Taahir Ahmed (ahmedtd) · 2026-06-17T00:52:53Z

+	SandboxClassGvisor SandboxClass = "gvisor"
+	// SandboxClassMicroVM is the micro-VM runtime (cmd/ateom-microvm); needs
+	// /dev/kvm and vhost devices.
+	SandboxClassMicroVM SandboxClass = "microvm"


We probably shouldn't squat on the "microvm" name for one particular microvm implementation?

I've been on the fence about this.

We could call it "cloud-hypervisor" (ignores kata, verbose) "kata" (ignores specific vm implementation).

I think it's OK to squat on this, most users that want a microvm probably don't need a more specific name, it gives us some freedom to iterate on the implementation details.

gvisor on the other hand seems to be in a class of it's own with a different model and no comparable alternatives.

If there's a really compelling additional microvm-like backend ... we'll come up with a name for it ("firecracker" ?), so far the other folks looking at this have settled on chv as well I think.

+1 to call it "cloud-hypervisor" or "chv"

I'd prefer to stick with microvm for now. If this is fully extensible later then a different implementation can use the same sandbox class name and be more "portable" for users.

"chv" is too opaque to an end-user and kata + chv are currently implementation details that don't even exist yet.

Later we might want to move to an open string, but I don't think we are there yet and I didn't want to stall uVM further. When we have 2 decent ateoms we will have a better idea what extensible looks like. Modal for example offers "sandbox" (gvisor) and "vm sandbox" (chv?).

Taahir Ahmed (ahmedtd) · 2026-06-17T00:57:22Z

+// and decouples sandbox binary selection from ActorTemplate.
+//
+// +genclient
+// +genclient:nonNamespaced


Why non-namespaced? Wouldn't namespaced, and in the same namespace as the workerpool make sense?

A cluster scoped resource might be restrictive if you're not the admin of the Kubernetes cluster.

Why non-namespaced? Wouldn't namespaced, and in the same namespace as the workerpool make sense?

I was patterning it after StorageClass which is cluster scoped and has one tagged default.

So you could avoid configuring this repeatedly if you just want all worker pools of the same sandboxClass to use the same binaries.

It's not quite the same since you have one default SandboxConfig per sandboxClass.

A cluster scoped resource might be restrictive if you're not the admin of the Kubernetes cluster.

Yeah, at first blush I thought that might be a feature, it's pretty privileged to configure this, since we'll run the referenced binaries with high permissions.

I imagine similar to StorageClass a platform / vendor might want to configure these across all namespaces?

I don't feel super strongly about this part though.

I was mostly thinking about how to factor this out towards something semi-extensible that the casual user could ignore. I'm genuinely not sure if nonNamespaced makes sense.

Slightly orthogonal, but I've also been mulling putting defaults into the ateom binaries, so if you really don't care you don't have to configure this at all.

So you could avoid configuring this repeatedly if you just want all worker pools of the same sandboxClass to use the same binaries.

I also don't feel strongly. However, I think that actually having it namespaced does not preclude referencing it from a different namespace. So a cluster administrator could provide some default ones in a special namespace, and give everyone permission to refer to them (via some K8s RBAC, I guess)

Not that it’s the same, but the GatewayClass is also cluster scoped. I agree that namespaced can be shared, but given that other similar resources are cluster scoped, it seems easier to start there and then fan out later. What do yall think?

I think we should merge and iterate on this point. It's easy enough to refactor and we have a lot to add on top.

Benjamin Elder (BenTheElder) · 2026-06-17T01:56:53Z

[trivial rebase to pickup CI fix]

Benjamin Elder (BenTheElder) · 2026-06-17T03:50:48Z

Benjamin Elder (@BenTheElder) Benjamin Elder (BenTheElder) dismissed Taahir Ahmed (ahmedtd)’s stale review via 10f24ce

This is a weird github-ism. AFAICT what I did was ... push with an additional commit? But it's phrased as if I went out of my way to "dismiss" the review ...

Davanum Srinivas (dims) · 2026-06-17T12:37:28Z

+			Default:      true,
+			Assets: map[string]map[string]atev1alpha1.AssetFile{
+				"amd64": {"runsc": {
+					URL:    "gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc",


when do these get pruned? (from the gcs bucket)

No idea, but this value isn't new to the PR. This is what we ship in clusters. We can update both when a release has the fixes we need.

Davanum Srinivas (dims) · 2026-06-17T12:51:57Z

Benjamin Elder (@BenTheElder) there is consensus that anyone can create their own microvm implementation without that code living in our repo? (just making sure we are all in violent agreement).

Benjamin Elder (BenTheElder) · 2026-06-17T14:39:42Z

Benjamin Elder (@BenTheElder) there is consensus that anyone can create their own microvm implementation without that code living in our repo? (just making sure we are all in violent agreement).

It sounds like a nice future but landing one that works first is the top priority for POC 2 ("alpha", end of June). This is going way out of the way to make the layering more prepared for that sort of thing over just landing it but it doesn't solve all of the blockers.

EDIT: IOW: yes ... eventually, but it's not clear yet if that's an ateom, an atelet or what. It's not totally clear if the ateom/atelet split will even survive, or if we should merge these two ateoms later (and let the same worker fungibly run different sandbox types, I intentionally didn't do that for now so we can figure out different ateoms / worker pools). This PR gets us closer to extensible.

…late Introduce a cluster-scoped SandboxConfig CRD holding the sandbox binaries (per sandbox class, per arch: url + sha256) and decouple binary selection from the ActorTemplate: - New SandboxConfig with SandboxClass (gvisor|microvm), a Default flag (cluster default per class), and Assets[arch][name]=AssetFile. - WorkerPool gains SandboxClass + SandboxConfigName (resolve by name, else the cluster default for the class). - ActorTemplate drops Runsc and the runsc/auth config types. - Regenerated CRDs, deepcopy, and clientset/listers/informers. - docs/api-guide.md: document SandboxConfig and the new WorkerPool fields.

RunRequest now carries a backend-agnostic SandboxAssets (sandbox_class + assets keyed by arch then asset name). CheckpointRequest/RestoreRequest reserve the old runsc field: checkpoint uses the version recorded on-node at Run/Restore, and restore reads it from the snapshot manifest.

- Fetch sandbox binaries from RunRequest.SandboxAssets (content-addressed, cached); for gVisor this is the "runsc" asset passed to ateom as RunscPath. - Record the running version in per-actor on-node state at Run/Restore so a later Checkpoint can re-fetch and pin it (ateompath.ActorSandboxAssetsFile). - Write a snapshot manifest (manifest.json) beside the checkpoint images at Checkpoint and read it at Restore, so restore is self-describing across nodes. Adds ategcs.SendBytesToGCS for the manifest upload.

On boot-from-spec (Run), resolve the actor's WorkerPool, take its SandboxClass, and pick the named or cluster-default SandboxConfig, sending the result as RunRequest.SandboxAssets. Checkpoint/Restore no longer send sandbox config (atelet uses the on-node record / snapshot manifest). - New resolveSandboxAssets helper; thread WorkerPool + SandboxConfig listers through NewService/NewActorWorkflow into the resume Run step. - Drop the RunscConfig plumbing from resume/pause/suspend. - Grant ate-api-server get/watch/list on workerpools + sandboxconfigs. - functional_test: create a WorkerPool + default gvisor SandboxConfig.

- Ship a cluster-default gvisor SandboxConfig (gvisor-default) carrying the runsc assets and apply it in install-ate.sh so gVisor pools work out of the box; ensure_crds also checks for the SandboxConfig CRD. - Remove the now-removed runsc block from all demo templates and the e2e probe fixture; they rely on the default SandboxConfig. - e2e demo_test: stop copying ActorTemplate.Runsc; carry the WorkerPool's SandboxClass/SandboxConfigName instead.

The CRD already requires each AssetFile to set url + sha256 (sha256 must be 64 hex). Add a ValidatingAdmissionPolicy enforcing the per-class requirement the CRD schema can't express: a gvisor SandboxConfig must define a "runsc" asset for every architecture under spec.assets (fail-closed at apply time, instead of only failing later when an actor tries to boot). Applied before the default SandboxConfig in install-ate.sh so the defaults are validated too. Add an envtest test that loads the shipped policy, waits for it to activate, and covers both the VAP (gvisor must have runsc) and the CRD-level url/sha256 requirements.

Eitan Yarmush (EItanya) · 2026-06-17T16:06:46Z

+	SHA256 string `json:"sha256"`
+}
+
+// SandboxConfigSpec is the desired state of a SandboxConfig.


When I originally read the issue I think I was envisioning something more akin to SandboxClass. Wherein the Sandbox class would define the assets, and there could be any number of them with arbitrary names. This one seems to instead have hardcoded class names. If we're going to use hardcoded class names I have a couple of follow-up questions:

why not make the config blocks for gvisor vs ch explicit rather than a generic map?

Why would a user need multiple SandboxConfig?

[...] This one seems to instead have hardcoded class names. [...]

To start, because it's not really possible to bring a custom one yet. But you can see how that would update pretty easily (here, the other parts are trickier)

why not make the config blocks for gvisor vs ch explicit rather than a generic map?

It's not really necessary so far. They both just need a set of assets which should be recorded alongside the snapshot. A per-class VAP (see last commit) can ensure you have the right list of assets while the generic config shape can make sure you have set url+sha etc.

I think other properties of the sandbox, like say, memory capacity, volume mounts, will be on the actortemplate side.

Why would a user need multiple SandboxConfig?

You can set different SandboxConfig on different workerpools. For example to A/B a new gvisor release.

EDIT: Which to an ActorTemplate is still the same "sandbox class", it's more akin to a different release of that class. Everything in SandboxConfig is a versioned asset for a sandboxClass. It might need a different name ...

Benjamin Elder (BenTheElder) marked this pull request as ready for review June 17, 2026 00:43

Benjamin Elder (BenTheElder) mentioned this pull request Jun 17, 2026

WIP: introduce ateom-cloud-hypervisor #239

Closed

2 tasks

Taahir Ahmed (ahmedtd) previously approved these changes Jun 17, 2026

View reviewed changes

Benjamin Elder (BenTheElder) force-pushed the sandbox-bins branch from 50c978b to 165fde6 Compare June 17, 2026 01:54

Benjamin Elder (BenTheElder) dismissed Taahir Ahmed (ahmedtd)’s stale review via 10f24ce June 17, 2026 02:15

Davanum Srinivas (dims) reviewed Jun 17, 2026

View reviewed changes

Comment thread cmd/ateapi/internal/controlapi/sandbox_assets.go

Davanum Srinivas (dims) reviewed Jun 17, 2026

View reviewed changes

Comment thread cmd/atelet/main.go

Benjamin Elder (BenTheElder) added 3 commits June 17, 2026 08:37

Benjamin Elder (BenTheElder) force-pushed the sandbox-bins branch from 10f24ce to f5be2fa Compare June 17, 2026 15:38

Benjamin Elder (BenTheElder) added 3 commits June 17, 2026 08:49

Benjamin Elder (BenTheElder) force-pushed the sandbox-bins branch from f5be2fa to ca581e1 Compare June 17, 2026 15:49

Eitan Yarmush (EItanya) reviewed Jun 17, 2026

View reviewed changes

Taahir Ahmed (ahmedtd) approved these changes Jun 17, 2026

View reviewed changes

Benjamin Elder (BenTheElder) merged commit 00a90a8 into agent-substrate:main Jun 17, 2026
8 checks passed

Benjamin Elder (BenTheElder) deleted the sandbox-bins branch June 17, 2026 17:11

Benjamin Elder (BenTheElder) mentioned this pull request Jun 17, 2026

Add kata + cloud-hypervisor micro-VM sandbox class #269

Closed

Conversation

Benjamin Elder (BenTheElder) commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Benjamin Elder (BenTheElder) Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Benjamin Elder (BenTheElder) commented Jun 17, 2026

Uh oh!

Benjamin Elder (BenTheElder) commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Davanum Srinivas (dims) commented Jun 17, 2026

Uh oh!

Benjamin Elder (BenTheElder) commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Benjamin Elder (BenTheElder) Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Benjamin Elder (BenTheElder) Jun 17, 2026 •

edited

Loading

Benjamin Elder (BenTheElder) commented Jun 17, 2026 •

edited

Loading

Benjamin Elder (BenTheElder) Jun 17, 2026 •

edited

Loading