feat: substrate support for BYO and python runtimes for SandboxAgent CR#2043
feat: substrate support for BYO and python runtimes for SandboxAgent CR#2043jmhbh wants to merge 6 commits into
Conversation
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
|
Warning Testing pausedMonthly snapshot limit reached. Update your plan for additional snapshots and to resume testing. |
EItanya
left a comment
There was a problem hiding this comment.
Awesome job overall. I have some design questions here about getting rid of old templates and how that should work.
| // Substrate SandboxAgents always use Go; regular Agents honor spec.declarative.runtime. | ||
| // All agents (including substrate SandboxAgents) honor spec.declarative.runtime, defaulting | ||
| // to Python when unset. | ||
| func EffectiveDeclarativeRuntimeForAgent(agent AgentObject) DeclarativeRuntime { |
There was a problem hiding this comment.
This feels like an unnecessary wrapper now?
There was a problem hiding this comment.
agreed. removing this function
| # distroless/cc provides glibc + libstdc++ (required by the standalone CPython build) but no | ||
| # shell or package manager. Agents that need in-container code execution / bash tools use the | ||
| # "full" image (python/Dockerfile.full) instead. | ||
| FROM gcr.io/distroless/cc-debian12:nonroot |
There was a problem hiding this comment.
Can we pick a tag/digest for this so we don't randomly get upgraded?
| // golden until its own golden is Ready (see ResolveCurrentActorTemplate), so this never causes | ||
| // downtime. Returns true when nothing more remains to retire. Errors are logged, not surfaced, so a | ||
| // transient ate-api failure doesn't wedge reconciliation. | ||
| func (r *SandboxAgentController) reconcileSubstrateBlueGreen(ctx context.Context, sa *v1alpha2.SandboxAgent) bool { |
There was a problem hiding this comment.
So this is very interesting, I think we need to think through this whole use-case. I wonder if we can use some sort of name based on the hash, similar to a GenerateName so that everything is eventually consistent, rather than blocking
There was a problem hiding this comment.
yeah this is definitely worth discussing in more detail. These changes are intended to addres the current gap where a config change would not be reflected in the agent (ie: if changing the model provider in the sandbox agent). So, to your point of using a name based on the hash, the actor template name is now a combination of its truncated name + truncated config hash. New golden snapshots are created on creation of the new template, and we write the config hash as an annotation to the actor template which is embedded in the actor id (we retire the old template as a part of this function). When the new actor is being resolved on the request path with the new template's hash any misses will trigger actor creation and the request will be held until the actor is reachable. So right now on the request path, if you rollout a config change and message the agent while the config is being rolled out the if the new golden isn't ready the request will be served by the previous actor (there's some nuance here depending on the available workers) but if it is ready then a new actor will be created and the request will be held until the new actor can serve the request.
| if r.SubstrateLifecycle == nil { | ||
| return true | ||
| } | ||
| retireDone, err := r.SubstrateLifecycle.RetireSupersededTemplates(ctx, sa) | ||
| if err != nil { | ||
| sandboxAgentControllerLog.Info("retiring superseded substrate templates failed (will retry)", | ||
| "sandboxagent", sa.Namespace+"/"+sa.Name, "err", err.Error()) | ||
| return true | ||
| } | ||
|
|
||
| // Best-effort reap of stale session actors keyed to a previous config. Not required for | ||
| // correctness (config-hashed ids mean they're never reused), so failures don't requeue. | ||
| if r.SubstrateActorBackend != nil { |
There was a problem hiding this comment.
These constant nil checks are super ugly, we gotta figure that out.
There was a problem hiding this comment.
agreed, cleaned this up.
| // ReapStaleSessionActors deletes this agent's per-session actors that were created from a | ||
| // superseded ActorTemplate (before a config change). It only deletes SUSPENDED actors: a RUNNING | ||
| // one may be mid-request, and kagent's transport suspends session actors after each request, so a | ||
| // stale actor converges to SUSPENDED on its own — force-suspending it could cut a live response. | ||
| // With config-hashed actor ids these actors are never addressed again, so this is storage hygiene, | ||
| // not correctness, and is best-effort. (Superseded GOLDEN actors are handled by | ||
| // Lifecycle.RetireSupersededTemplates, which removes them with their template once a newer | ||
| // template is Ready.) Returns true when no stale suspended session actors remain. |
There was a problem hiding this comment.
I don't think we should do this. Just because the template is gone doesn't mean the actor is invalid. What do you think?
There was a problem hiding this comment.
I agree that they aren't invalid but these actors would essentially be unreachable given our current implementation since we now use the config hash on the actor template to resolve actors. I'll look into what the cost is on the substrate side for leaving these actors around but given the current implementation I think it makes more sense to just remove them.
| // config change must produce a NEW ActorTemplate (substrate snapshots once and | ||
| // no-ops in PhaseReady); folding the hash into the template name does that, and the | ||
| // annotation lets the chat path/reaper key session actors to the active template. | ||
| SandboxConfigHashAnnotation = "kagent.dev/config-hash" |
There was a problem hiding this comment.
Can we define this somewhere globally, or use the generated name like I mentioned above
There was a problem hiding this comment.
moved to a separate consts package.
| // then surfaces as a gVisor "inconsistent private memory files on restore" error because the | ||
| // golden snapshot captures only the pause container. The Go static binary needs none of this. | ||
| // Keep in sync with the final-stage ENV block of python/Dockerfile. | ||
| func pythonRuntimeImageEnv() []corev1.EnvVar { |
There was a problem hiding this comment.
This feels hacky, is there someway we can fix substrate, or the gvisor impl to properly do this?
There was a problem hiding this comment.
yup, this can be fixed in substrate. I can open a fix for this upstream.
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
| CONTROLLER_IMAGE_TAG ?= $(VERSION) | ||
| UI_IMAGE_TAG ?= $(VERSION) | ||
| APP_IMAGE_TAG ?= $(VERSION) | ||
| APP_FULL_IMAGE_TAG ?= $(VERSION)-full |
There was a problem hiding this comment.
as a part of moving away from the base chainguard images since we currently cannot pin the images we are introducing a -full suffixed tag which uses node:20-bookworm-slim as the base image to include shell, apt, and the sandbox runtime. This mirrors the existing pattern for the golang images. The existing unsuffixed flavor now uses a distroless base image instead.
| // Substrate SandboxAgents always use Go; regular Agents honor spec.declarative.runtime. | ||
| // All agents (including substrate SandboxAgents) honor spec.declarative.runtime, defaulting | ||
| // to Python when unset. | ||
| func EffectiveDeclarativeRuntimeForAgent(agent AgentObject) DeclarativeRuntime { |
There was a problem hiding this comment.
agreed. removing this function
| if r.SubstrateLifecycle == nil { | ||
| return true | ||
| } | ||
| retireDone, err := r.SubstrateLifecycle.RetireSupersededTemplates(ctx, sa) | ||
| if err != nil { | ||
| sandboxAgentControllerLog.Info("retiring superseded substrate templates failed (will retry)", | ||
| "sandboxagent", sa.Namespace+"/"+sa.Name, "err", err.Error()) | ||
| return true | ||
| } | ||
|
|
||
| // Best-effort reap of stale session actors keyed to a previous config. Not required for | ||
| // correctness (config-hashed ids mean they're never reused), so failures don't requeue. | ||
| if r.SubstrateActorBackend != nil { |
There was a problem hiding this comment.
agreed, cleaned this up.
| // golden until its own golden is Ready (see ResolveCurrentActorTemplate), so this never causes | ||
| // downtime. Returns true when nothing more remains to retire. Errors are logged, not surfaced, so a | ||
| // transient ate-api failure doesn't wedge reconciliation. | ||
| func (r *SandboxAgentController) reconcileSubstrateBlueGreen(ctx context.Context, sa *v1alpha2.SandboxAgent) bool { |
There was a problem hiding this comment.
yeah this is definitely worth discussing in more detail. These changes are intended to addres the current gap where a config change would not be reflected in the agent (ie: if changing the model provider in the sandbox agent). So, to your point of using a name based on the hash, the actor template name is now a combination of its truncated name + truncated config hash. New golden snapshots are created on creation of the new template, and we write the config hash as an annotation to the actor template which is embedded in the actor id (we retire the old template as a part of this function). When the new actor is being resolved on the request path with the new template's hash any misses will trigger actor creation and the request will be held until the actor is reachable. So right now on the request path, if you rollout a config change and message the agent while the config is being rolled out the if the new golden isn't ready the request will be served by the previous actor (there's some nuance here depending on the available workers) but if it is ready then a new actor will be created and the request will be held until the new actor can serve the request.
| // ReapStaleSessionActors deletes this agent's per-session actors that were created from a | ||
| // superseded ActorTemplate (before a config change). It only deletes SUSPENDED actors: a RUNNING | ||
| // one may be mid-request, and kagent's transport suspends session actors after each request, so a | ||
| // stale actor converges to SUSPENDED on its own — force-suspending it could cut a live response. | ||
| // With config-hashed actor ids these actors are never addressed again, so this is storage hygiene, | ||
| // not correctness, and is best-effort. (Superseded GOLDEN actors are handled by | ||
| // Lifecycle.RetireSupersededTemplates, which removes them with their template once a newer | ||
| // template is Ready.) Returns true when no stale suspended session actors remain. |
There was a problem hiding this comment.
I agree that they aren't invalid but these actors would essentially be unreachable given our current implementation since we now use the config hash on the actor template to resolve actors. I'll look into what the cost is on the substrate side for leaving these actors around but given the current implementation I think it makes more sense to just remove them.
| // then surfaces as a gVisor "inconsistent private memory files on restore" error because the | ||
| // golden snapshot captures only the pause container. The Go static binary needs none of this. | ||
| // Keep in sync with the final-stage ENV block of python/Dockerfile. | ||
| func pythonRuntimeImageEnv() []corev1.EnvVar { |
There was a problem hiding this comment.
yup, this can be fixed in substrate. I can open a fix for this upstream.
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
Signed-off-by: JM Huibonhoa <jm.huibonhoa@solo.io>
No description provided.