Problem Statement
Use case: an agent inside an OpenShell sandbox needs to talk to allow-listed
services that don't speak HTTP. Concretely PostgreSQL via psycopg, but the
same shape applies to Redis, MongoDB, MySQL, gRPC clients without
HTTPS_PROXY support, and anything else with a non-HTTP wire protocol. The
policy already has a protocol: tcp field that the parser accepts but
that has no runtime effect.
Our agent is a long-running Slack bot built on the Claude Agent SDK,
currently deployed as a systemd service on EC2. Its tool surface includes
a Bash runner that the SDK uses to spawn a Python data-access CLI as a
subprocess; that CLI talks to several PostgreSQL databases, all inside the
same AWS VPC. The Slack and Anthropic paths run through HTTPS_PROXY and
work cleanly inside an OpenShell sandbox — we have a working end-to-end
spike that confirms this. The blocker is that the CLI's database
connections are plain TCP (PostgreSQL wire protocol on port 5432, no HTTP
layer), so they rely on getaddrinfo rather than the proxy. This shape —
agent SDK orchestrating native-protocol data clients — is common across
LangChain, LlamaIndex, and MCP-tool patterns, not specific to our app.
Proposed Design
When the supervisor loads or reloads a sandbox policy, walk every endpoint
with protocol: tcp and resolve its hostname against the gateway-side
resolver (which already has working DNS). For each successful resolution,
write an <ip> <hostname> line into the sandbox's /etc/hosts. The write
happens before Landlock is applied, so the read-only-/etc constraint at
runtime is unaffected. Failed lookups log a warning and the endpoint is
treated as unreachable; sandbox creation does not fail (consistent with
how a missing host would behave on a non-sandboxed machine).
Existing TCP enforcement reuses the same iptables/netns shape that backs
HTTP CONNECT today. Outbound TCP destined to a resolved-and-allowed
(IP, port) tuple is intercepted at the host-side veth (10.200.0.1)
and forwarded; outbound TCP to anything else is dropped. The supervisor
maintains an in-memory {resolved_ip → policy_endpoint} map so the policy
decision stays hostname-keyed even though wire-level traffic matches by
IP — the same model the proxy already uses for HTTP CONNECT, just without
the HTTP layer.
Lifecycle
- On
sandbox create: resolve all protocol: tcp endpoints, populate
/etc/hosts, then hand off to the supervisor.
- On
policy set (hot reload): re-resolve, atomically rewrite
/etc/hosts (or update in place), refresh the IP allow set in
iptables. Cached IPs for endpoints removed from the policy are evicted.
- TTL handling: re-resolve on every policy reload. Apps that need
long-running connections (e.g. a postgres pool) tolerate this because
the resolved IP is captured at connect time; new connections after a
policy change pick up the new IP.
Observability
Each TCP NET:OPEN / NET:CLOSE event in openshell logs includes the
matched policy endpoint by hostname (mirroring today's HTTP events), and
a new DNS:RESOLVE event records the supervisor-side resolution so
operators can see when an entry was added or rotated.
Out of scope
Arbitrary DNS resolution, broader resolver access from inside the
sandbox, or policy entries without explicit hostnames. The sandbox itself
never gets a usable resolver — it just gets a static answer for the hosts
the policy already allows.
Alternatives Considered
Sandbox-internal DNS forwarder (NXDOMAIN for non-policy hosts)
A small UDP/53 listener inside the sandbox netns (or on the host-side
veth at 10.200.0.1:53) that answers only for hostnames present in the
loaded policy and returns NXDOMAIN for everything else. Resolution
happens inline at query time using the gateway's resolver.
More general than /etc/hosts injection — handles short-TTL hosts and
wildcard CNAME chains cleanly without a re-resolve cycle on every policy
reload. The cost is a long-running listener in (or alongside) the sandbox
and a dedicated UDP/53 code path. For the named-host policy model we
already have, this is strictly more code than /etc/hosts and not
strictly necessary; preferring it would only matter once the policy
language grows wildcard hostnames.
SOCKS5 endpoint on the existing proxy
Expose SOCKS5 alongside the existing HTTP CONNECT endpoint at
10.200.0.1. Clients configured with rdns=true send the hostname to
the proxy, which resolves and connects per the policy. No /etc/hosts
or sandbox-side DNS is needed.
Works for clients that natively support SOCKS5 — but database wire
protocols generally don't. psycopg, mysql-connector-python, the
mongo and redis Python clients, and most compiled clients in Go, Rust,
or Java need either a PySocks-style monkey-patch (Python only) or a
sidecar TCP-to-SOCKS5 bridge. That pushes per-language complexity into
every adopter of OpenShell rather than solving the problem once in the
runtime.
Acceptable as an additional path, but not as a replacement for
hostname-aware TCP egress.
LD_PRELOAD / sitecustomize.py shims to redirect getaddrinfo
Not pursued. Either approach effectively bypasses the policy enforcement
the rest of the sandbox relies on (the application sees an IP that the
sandbox supervisor never approved). It would also be invisible to
openshell logs, defeating the auditability of the existing model.
Agent Investigation
Spike Findings
We built and ran an end-to-end spike validating an OpenShell sandbox as a
deployment target for an existing production agent. PR with all artefacts:
https://github.com/cim-data-engineering/agent-toni/pull/17. OpenShell 0.0.36
on both Colima/macOS arm64 and Amazon Linux 2023 / EC2 amd64 (same VPC as
the postgres targets). Both hosts reproduce the failure identically.
What worked
Every layer of the sandbox model is functional except raw-TCP egress:
- Image build via
--from <staged-dir> extending
ghcr.io/nvidia/openshell-community/sandboxes/base
- K3s push + pull, supervisor + sshd plumbing, sandbox user (uid 998)
- Filesystem policy under Landlock with mixed read-only / read-write paths
- Process policy with binary symlink resolution covering uv-managed Python
interpreters
- Network policy hot-reload via
openshell policy set --wait
- Anthropic API and Slack Web API egress through the L7 proxy
- Slack Socket Mode WebSocket via TCP-passthrough on
wss-primary.slack.com:443
- Secrets injected at sandbox-create time via
--upload <local>:/sandbox/upload
What fails: raw-TCP egress to allow-listed hosts
The agent's data-access path is a Python CLI (peak-cli) that the agent
invokes as a subprocess; it opens postgres connections via
psycopg.connect(host='apidb-...cimenviro.com', port=5432). With the
postgres host explicitly named in the policy (protocol: tcp,
enforcement: enforce, access: full) and matching binaries: for the
Python interpreters, the connection still fails:
psycopg.OperationalError: could not translate host name
"apidb-ap-southeast-2-replica-2.cimenviro.com" to address:
Temporary failure in name resolution
Confirmed via openshell sandbox exec:
$ cat /etc/resolv.conf
nameserver 10.43.0.10 # K3s CoreDNS, unreachable from sandbox netns
options ndots:5
$ getent hosts ....com
(no output, NXDOMAIN)
$ getent hosts api.anthropic.com
(no output, NXDOMAIN, even though the bot reaches it fine via HTTPS_PROXY)
openshell logs <sandbox> shows zero NET:DENIED events for the postgres
host. The connection dies at libc getaddrinfo before any traffic enters
the sandbox network namespace's egress path. protocol: tcp is accepted
by the policy parser and visible in openshell policy get --full, but has
no runtime effect.
Workarounds attempted, blocked
- Pre-resolved
/etc/hosts baked into the image at build time
(COPY hosts.append /tmp/ && cat /tmp/hosts.append >> /etc/hosts):
blocked by Read-only file system during the OpenShell build. Docker
normally writes /etc/hosts at runtime, but the OpenShell build
environment locks it down even at build-time.
/etc/hosts rewrite at runtime via the entrypoint script: blocked by
Landlock; /etc sits in our policy's read_only list.
Implementation observations
- Sandbox netns IP
10.200.0.2, host-side veth at 10.200.0.1. Proxy
listens at 10.200.0.1:3128. iptables NAT could redirect TCP destined
to allow-listed IPs through this proxy without needing a SOCKS-aware
client.
- The supervisor logs every connection at
NET:OPEN / NET:CLOSE with
the binary path, destination, and policy hit. Adding protocol: tcp
traffic to that pipeline (rather than letting it die at libc) would
give operators the same observability that exists today for HTTP
egress.
- The supervisor already resolves binary-path symlinks at policy-load
time. Doing the same once per allow-listed hostname (resolving via the
gateway's resolver, injecting into /etc/hosts or an internal DNS
forwarder) is a similar shape.
Checklist
Problem Statement
Use case: an agent inside an OpenShell sandbox needs to talk to allow-listed
services that don't speak HTTP. Concretely PostgreSQL via psycopg, but the
same shape applies to Redis, MongoDB, MySQL, gRPC clients without
HTTPS_PROXY support, and anything else with a non-HTTP wire protocol. The
policy already has a
protocol: tcpfield that the parser accepts butthat has no runtime effect.
Our agent is a long-running Slack bot built on the Claude Agent SDK,
currently deployed as a systemd service on EC2. Its tool surface includes
a Bash runner that the SDK uses to spawn a Python data-access CLI as a
subprocess; that CLI talks to several PostgreSQL databases, all inside the
same AWS VPC. The Slack and Anthropic paths run through HTTPS_PROXY and
work cleanly inside an OpenShell sandbox — we have a working end-to-end
spike that confirms this. The blocker is that the CLI's database
connections are plain TCP (PostgreSQL wire protocol on port 5432, no HTTP
layer), so they rely on
getaddrinforather than the proxy. This shape —agent SDK orchestrating native-protocol data clients — is common across
LangChain, LlamaIndex, and MCP-tool patterns, not specific to our app.
Proposed Design
When the supervisor loads or reloads a sandbox policy, walk every endpoint
with
protocol: tcpand resolve its hostname against the gateway-sideresolver (which already has working DNS). For each successful resolution,
write an
<ip> <hostname>line into the sandbox's/etc/hosts. The writehappens before Landlock is applied, so the read-only-
/etcconstraint atruntime is unaffected. Failed lookups log a warning and the endpoint is
treated as unreachable; sandbox creation does not fail (consistent with
how a missing host would behave on a non-sandboxed machine).
Existing TCP enforcement reuses the same iptables/netns shape that backs
HTTP CONNECT today. Outbound TCP destined to a resolved-and-allowed
(IP, port)tuple is intercepted at the host-side veth (10.200.0.1)and forwarded; outbound TCP to anything else is dropped. The supervisor
maintains an in-memory
{resolved_ip → policy_endpoint}map so the policydecision stays hostname-keyed even though wire-level traffic matches by
IP — the same model the proxy already uses for HTTP CONNECT, just without
the HTTP layer.
Lifecycle
sandbox create: resolve allprotocol: tcpendpoints, populate/etc/hosts, then hand off to the supervisor.policy set(hot reload): re-resolve, atomically rewrite/etc/hosts(or update in place), refresh the IP allow set iniptables. Cached IPs for endpoints removed from the policy are evicted.
long-running connections (e.g. a postgres pool) tolerate this because
the resolved IP is captured at connect time; new connections after a
policy change pick up the new IP.
Observability
Each TCP
NET:OPEN/NET:CLOSEevent inopenshell logsincludes thematched policy endpoint by hostname (mirroring today's HTTP events), and
a new
DNS:RESOLVEevent records the supervisor-side resolution sooperators can see when an entry was added or rotated.
Out of scope
Arbitrary DNS resolution, broader resolver access from inside the
sandbox, or policy entries without explicit hostnames. The sandbox itself
never gets a usable resolver — it just gets a static answer for the hosts
the policy already allows.
Alternatives Considered
Sandbox-internal DNS forwarder (NXDOMAIN for non-policy hosts)
A small UDP/53 listener inside the sandbox netns (or on the host-side
veth at
10.200.0.1:53) that answers only for hostnames present in theloaded policy and returns
NXDOMAINfor everything else. Resolutionhappens inline at query time using the gateway's resolver.
More general than
/etc/hostsinjection — handles short-TTL hosts andwildcard CNAME chains cleanly without a re-resolve cycle on every policy
reload. The cost is a long-running listener in (or alongside) the sandbox
and a dedicated UDP/53 code path. For the named-host policy model we
already have, this is strictly more code than
/etc/hostsand notstrictly necessary; preferring it would only matter once the policy
language grows wildcard hostnames.
SOCKS5 endpoint on the existing proxy
Expose SOCKS5 alongside the existing HTTP CONNECT endpoint at
10.200.0.1. Clients configured withrdns=truesend the hostname tothe proxy, which resolves and connects per the policy. No
/etc/hostsor sandbox-side DNS is needed.
Works for clients that natively support SOCKS5 — but database wire
protocols generally don't.
psycopg,mysql-connector-python, themongo and redis Python clients, and most compiled clients in Go, Rust,
or Java need either a
PySocks-style monkey-patch (Python only) or asidecar TCP-to-SOCKS5 bridge. That pushes per-language complexity into
every adopter of OpenShell rather than solving the problem once in the
runtime.
Acceptable as an additional path, but not as a replacement for
hostname-aware TCP egress.
LD_PRELOAD/sitecustomize.pyshims to redirectgetaddrinfoNot pursued. Either approach effectively bypasses the policy enforcement
the rest of the sandbox relies on (the application sees an IP that the
sandbox supervisor never approved). It would also be invisible to
openshell logs, defeating the auditability of the existing model.Agent Investigation
Spike Findings
We built and ran an end-to-end spike validating an OpenShell sandbox as a
deployment target for an existing production agent. PR with all artefacts:
https://github.com/cim-data-engineering/agent-toni/pull/17. OpenShell 0.0.36
on both Colima/macOS arm64 and Amazon Linux 2023 / EC2 amd64 (same VPC as
the postgres targets). Both hosts reproduce the failure identically.
What worked
Every layer of the sandbox model is functional except raw-TCP egress:
--from <staged-dir>extendingghcr.io/nvidia/openshell-community/sandboxes/baseinterpreters
openshell policy set --waitwss-primary.slack.com:443--upload <local>:/sandbox/uploadWhat fails: raw-TCP egress to allow-listed hosts
The agent's data-access path is a Python CLI (
peak-cli) that the agentinvokes as a subprocess; it opens postgres connections via
psycopg.connect(host='apidb-...cimenviro.com', port=5432). With thepostgres host explicitly named in the policy (
protocol: tcp,enforcement: enforce,access: full) and matchingbinaries:for thePython interpreters, the connection still fails:
Confirmed via
openshell sandbox exec:openshell logs <sandbox>shows zeroNET:DENIEDevents for the postgreshost. The connection dies at libc
getaddrinfobefore any traffic entersthe sandbox network namespace's egress path.
protocol: tcpis acceptedby the policy parser and visible in
openshell policy get --full, but hasno runtime effect.
Workarounds attempted, blocked
/etc/hostsbaked into the image at build time(
COPY hosts.append /tmp/ && cat /tmp/hosts.append >> /etc/hosts):blocked by
Read-only file systemduring the OpenShell build. Dockernormally writes
/etc/hostsat runtime, but the OpenShell buildenvironment locks it down even at build-time.
/etc/hostsrewrite at runtime via the entrypoint script: blocked byLandlock;
/etcsits in our policy'sread_onlylist.Implementation observations
10.200.0.2, host-side veth at10.200.0.1. Proxylistens at
10.200.0.1:3128. iptables NAT could redirect TCP destinedto allow-listed IPs through this proxy without needing a SOCKS-aware
client.
NET:OPEN/NET:CLOSEwiththe binary path, destination, and policy hit. Adding
protocol: tcptraffic to that pipeline (rather than letting it die at libc) would
give operators the same observability that exists today for HTTP
egress.
time. Doing the same once per allow-listed hostname (resolving via the
gateway's resolver, injecting into
/etc/hostsor an internal DNSforwarder) is a similar shape.
Checklist