Skip to content

feat: substrate support for BYO and python runtimes for SandboxAgent CR#2043

Draft
jmhbh wants to merge 6 commits into
mainfrom
feat/substrate-python-byo
Draft

feat: substrate support for BYO and python runtimes for SandboxAgent CR#2043
jmhbh wants to merge 6 commits into
mainfrom
feat/substrate-python-byo

Conversation

@jmhbh

@jmhbh jmhbh commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
@github-actions github-actions Bot added the enhancement New feature or request label Jun 17, 2026
@chromatic-com

chromatic-com Bot commented Jun 17, 2026

Copy link
Copy Markdown

Warning

Testing paused

Monthly snapshot limit reached. Update your plan for additional snapshots and to resume testing.

@EItanya EItanya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job overall. I have some design questions here about getting rid of old templates and how that should work.

Comment thread go/api/v1alpha2/agent_types.go Outdated
// Substrate SandboxAgents always use Go; regular Agents honor spec.declarative.runtime.
// All agents (including substrate SandboxAgents) honor spec.declarative.runtime, defaulting
// to Python when unset.
func EffectiveDeclarativeRuntimeForAgent(agent AgentObject) DeclarativeRuntime {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an unnecessary wrapper now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. removing this function

Comment thread python/Dockerfile Outdated
# distroless/cc provides glibc + libstdc++ (required by the standalone CPython build) but no
# shell or package manager. Agents that need in-container code execution / bash tools use the
# "full" image (python/Dockerfile.full) instead.
FROM gcr.io/distroless/cc-debian12:nonroot

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pick a tag/digest for this so we don't randomly get upgraded?

// golden until its own golden is Ready (see ResolveCurrentActorTemplate), so this never causes
// downtime. Returns true when nothing more remains to retire. Errors are logged, not surfaced, so a
// transient ate-api failure doesn't wedge reconciliation.
func (r *SandboxAgentController) reconcileSubstrateBlueGreen(ctx context.Context, sa *v1alpha2.SandboxAgent) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is very interesting, I think we need to think through this whole use-case. I wonder if we can use some sort of name based on the hash, similar to a GenerateName so that everything is eventually consistent, rather than blocking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is definitely worth discussing in more detail. These changes are intended to addres the current gap where a config change would not be reflected in the agent (ie: if changing the model provider in the sandbox agent). So, to your point of using a name based on the hash, the actor template name is now a combination of its truncated name + truncated config hash. New golden snapshots are created on creation of the new template, and we write the config hash as an annotation to the actor template which is embedded in the actor id (we retire the old template as a part of this function). When the new actor is being resolved on the request path with the new template's hash any misses will trigger actor creation and the request will be held until the actor is reachable. So right now on the request path, if you rollout a config change and message the agent while the config is being rolled out the if the new golden isn't ready the request will be served by the previous actor (there's some nuance here depending on the available workers) but if it is ready then a new actor will be created and the request will be held until the new actor can serve the request.

Comment on lines +52 to +64
if r.SubstrateLifecycle == nil {
return true
}
retireDone, err := r.SubstrateLifecycle.RetireSupersededTemplates(ctx, sa)
if err != nil {
sandboxAgentControllerLog.Info("retiring superseded substrate templates failed (will retry)",
"sandboxagent", sa.Namespace+"/"+sa.Name, "err", err.Error())
return true
}

// Best-effort reap of stale session actors keyed to a previous config. Not required for
// correctness (config-hashed ids mean they're never reused), so failures don't requeue.
if r.SubstrateActorBackend != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constant nil checks are super ugly, we gotta figure that out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, cleaned this up.

Comment on lines +196 to +203
// ReapStaleSessionActors deletes this agent's per-session actors that were created from a
// superseded ActorTemplate (before a config change). It only deletes SUSPENDED actors: a RUNNING
// one may be mid-request, and kagent's transport suspends session actors after each request, so a
// stale actor converges to SUSPENDED on its own — force-suspending it could cut a live response.
// With config-hashed actor ids these actors are never addressed again, so this is storage hygiene,
// not correctness, and is best-effort. (Superseded GOLDEN actors are handled by
// Lifecycle.RetireSupersededTemplates, which removes them with their template once a newer
// template is Ready.) Returns true when no stale suspended session actors remain.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. Just because the template is gone doesn't mean the actor is invalid. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they aren't invalid but these actors would essentially be unreachable given our current implementation since we now use the config hash on the actor template to resolve actors. I'll look into what the cost is on the substrate side for leaving these actors around but given the current implementation I think it makes more sense to just remove them.

// config change must produce a NEW ActorTemplate (substrate snapshots once and
// no-ops in PhaseReady); folding the hash into the template name does that, and the
// annotation lets the chat path/reaper key session actors to the active template.
SandboxConfigHashAnnotation = "kagent.dev/config-hash"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this somewhere globally, or use the generated name like I mentioned above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to a separate consts package.

// then surfaces as a gVisor "inconsistent private memory files on restore" error because the
// golden snapshot captures only the pause container. The Go static binary needs none of this.
// Keep in sync with the final-stage ENV block of python/Dockerfile.
func pythonRuntimeImageEnv() []corev1.EnvVar {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hacky, is there someway we can fix substrate, or the gvisor impl to properly do this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this can be fixed in substrate. I can open a fix for this upstream.

Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Comment thread Makefile
CONTROLLER_IMAGE_TAG ?= $(VERSION)
UI_IMAGE_TAG ?= $(VERSION)
APP_IMAGE_TAG ?= $(VERSION)
APP_FULL_IMAGE_TAG ?= $(VERSION)-full

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a part of moving away from the base chainguard images since we currently cannot pin the images we are introducing a -full suffixed tag which uses node:20-bookworm-slim as the base image to include shell, apt, and the sandbox runtime. This mirrors the existing pattern for the golang images. The existing unsuffixed flavor now uses a distroless base image instead.

Comment thread go/api/v1alpha2/agent_types.go Outdated
// Substrate SandboxAgents always use Go; regular Agents honor spec.declarative.runtime.
// All agents (including substrate SandboxAgents) honor spec.declarative.runtime, defaulting
// to Python when unset.
func EffectiveDeclarativeRuntimeForAgent(agent AgentObject) DeclarativeRuntime {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. removing this function

Comment on lines +52 to +64
if r.SubstrateLifecycle == nil {
return true
}
retireDone, err := r.SubstrateLifecycle.RetireSupersededTemplates(ctx, sa)
if err != nil {
sandboxAgentControllerLog.Info("retiring superseded substrate templates failed (will retry)",
"sandboxagent", sa.Namespace+"/"+sa.Name, "err", err.Error())
return true
}

// Best-effort reap of stale session actors keyed to a previous config. Not required for
// correctness (config-hashed ids mean they're never reused), so failures don't requeue.
if r.SubstrateActorBackend != nil {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, cleaned this up.

// golden until its own golden is Ready (see ResolveCurrentActorTemplate), so this never causes
// downtime. Returns true when nothing more remains to retire. Errors are logged, not surfaced, so a
// transient ate-api failure doesn't wedge reconciliation.
func (r *SandboxAgentController) reconcileSubstrateBlueGreen(ctx context.Context, sa *v1alpha2.SandboxAgent) bool {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is definitely worth discussing in more detail. These changes are intended to addres the current gap where a config change would not be reflected in the agent (ie: if changing the model provider in the sandbox agent). So, to your point of using a name based on the hash, the actor template name is now a combination of its truncated name + truncated config hash. New golden snapshots are created on creation of the new template, and we write the config hash as an annotation to the actor template which is embedded in the actor id (we retire the old template as a part of this function). When the new actor is being resolved on the request path with the new template's hash any misses will trigger actor creation and the request will be held until the actor is reachable. So right now on the request path, if you rollout a config change and message the agent while the config is being rolled out the if the new golden isn't ready the request will be served by the previous actor (there's some nuance here depending on the available workers) but if it is ready then a new actor will be created and the request will be held until the new actor can serve the request.

Comment on lines +196 to +203
// ReapStaleSessionActors deletes this agent's per-session actors that were created from a
// superseded ActorTemplate (before a config change). It only deletes SUSPENDED actors: a RUNNING
// one may be mid-request, and kagent's transport suspends session actors after each request, so a
// stale actor converges to SUSPENDED on its own — force-suspending it could cut a live response.
// With config-hashed actor ids these actors are never addressed again, so this is storage hygiene,
// not correctness, and is best-effort. (Superseded GOLDEN actors are handled by
// Lifecycle.RetireSupersededTemplates, which removes them with their template once a newer
// template is Ready.) Returns true when no stale suspended session actors remain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they aren't invalid but these actors would essentially be unreachable given our current implementation since we now use the config hash on the actor template to resolve actors. I'll look into what the cost is on the substrate side for leaving these actors around but given the current implementation I think it makes more sense to just remove them.

// then surfaces as a gVisor "inconsistent private memory files on restore" error because the
// golden snapshot captures only the pause container. The Go static binary needs none of this.
// Keep in sync with the final-stage ENV block of python/Dockerfile.
func pythonRuntimeImageEnv() []corev1.EnvVar {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this can be fixed in substrate. I can open a fix for this upstream.

jmhbh added 4 commits June 18, 2026 15:49
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants