feat(proxy): resolve policy-allow-listed TCP hostnames inside the sandbox so non-HTTP protocols are reachable

### Problem Statement

Use case: an agent inside an OpenShell sandbox needs to talk to allow-listed
services that don't speak HTTP. Concretely PostgreSQL via psycopg, but the
same shape applies to Redis, MongoDB, MySQL, gRPC clients without
HTTPS_PROXY support, and anything else with a non-HTTP wire protocol. The
policy already has a `protocol: tcp` field that the parser accepts but
that has no runtime effect.

Our agent is a long-running Slack bot built on the Claude Agent SDK,
currently deployed as a systemd service on EC2. Its tool surface includes
a Bash runner that the SDK uses to spawn a Python data-access CLI as a
subprocess; that CLI talks to several PostgreSQL databases, all inside the
same AWS VPC. The Slack and Anthropic paths run through HTTPS_PROXY and
work cleanly inside an OpenShell sandbox — we have a working end-to-end
spike that confirms this. The blocker is that the CLI's database
connections are plain TCP (PostgreSQL wire protocol on port 5432, no HTTP
layer), so they rely on `getaddrinfo` rather than the proxy. This shape —
agent SDK orchestrating native-protocol data clients — is common across
LangChain, LlamaIndex, and MCP-tool patterns, not specific to our app.



### Proposed Design

When the supervisor loads or reloads a sandbox policy, walk every endpoint
with `protocol: tcp` and resolve its hostname against the gateway-side
resolver (which already has working DNS). For each successful resolution,
write an `<ip> <hostname>` line into the sandbox's `/etc/hosts`. The write
happens before Landlock is applied, so the read-only-`/etc` constraint at
runtime is unaffected. Failed lookups log a warning and the endpoint is
treated as unreachable; sandbox creation does not fail (consistent with
how a missing host would behave on a non-sandboxed machine).

Existing TCP enforcement reuses the same iptables/netns shape that backs
HTTP CONNECT today. Outbound TCP destined to a resolved-and-allowed
`(IP, port)` tuple is intercepted at the host-side veth (`10.200.0.1`)
and forwarded; outbound TCP to anything else is dropped. The supervisor
maintains an in-memory `{resolved_ip → policy_endpoint}` map so the policy
decision stays hostname-keyed even though wire-level traffic matches by
IP — the same model the proxy already uses for HTTP CONNECT, just without
the HTTP layer.

### Lifecycle

- On `sandbox create`: resolve all `protocol: tcp` endpoints, populate
  `/etc/hosts`, then hand off to the supervisor.
- On `policy set` (hot reload): re-resolve, atomically rewrite
  `/etc/hosts` (or update in place), refresh the IP allow set in
  iptables. Cached IPs for endpoints removed from the policy are evicted.
- TTL handling: re-resolve on every policy reload. Apps that need
  long-running connections (e.g. a postgres pool) tolerate this because
  the resolved IP is captured at connect time; new connections after a
  policy change pick up the new IP.

### Observability

Each TCP `NET:OPEN` / `NET:CLOSE` event in `openshell logs` includes the
matched policy endpoint by hostname (mirroring today's HTTP events), and
a new `DNS:RESOLVE` event records the supervisor-side resolution so
operators can see when an entry was added or rotated.

### Out of scope

Arbitrary DNS resolution, broader resolver access from inside the
sandbox, or policy entries without explicit hostnames. The sandbox itself
never gets a usable resolver — it just gets a static answer for the hosts
the policy already allows.


### Alternatives Considered

### Sandbox-internal DNS forwarder (NXDOMAIN for non-policy hosts)

A small UDP/53 listener inside the sandbox netns (or on the host-side
veth at `10.200.0.1:53`) that answers only for hostnames present in the
loaded policy and returns `NXDOMAIN` for everything else. Resolution
happens inline at query time using the gateway's resolver.

More general than `/etc/hosts` injection — handles short-TTL hosts and
wildcard CNAME chains cleanly without a re-resolve cycle on every policy
reload. The cost is a long-running listener in (or alongside) the sandbox
and a dedicated UDP/53 code path. For the named-host policy model we
already have, this is strictly more code than `/etc/hosts` and not
strictly necessary; preferring it would only matter once the policy
language grows wildcard hostnames.

### SOCKS5 endpoint on the existing proxy

Expose SOCKS5 alongside the existing HTTP CONNECT endpoint at
`10.200.0.1`. Clients configured with `rdns=true` send the hostname to
the proxy, which resolves and connects per the policy. No `/etc/hosts`
or sandbox-side DNS is needed.

Works for clients that natively support SOCKS5 — but database wire
protocols generally don't. `psycopg`, `mysql-connector-python`, the
mongo and redis Python clients, and most compiled clients in Go, Rust,
or Java need either a `PySocks`-style monkey-patch (Python only) or a
sidecar TCP-to-SOCKS5 bridge. That pushes per-language complexity into
every adopter of OpenShell rather than solving the problem once in the
runtime.

Acceptable as an additional path, but not as a replacement for
hostname-aware TCP egress.

### `LD_PRELOAD` / `sitecustomize.py` shims to redirect `getaddrinfo`

Not pursued. Either approach effectively bypasses the policy enforcement
the rest of the sandbox relies on (the application sees an IP that the
sandbox supervisor never approved). It would also be invisible to
`openshell logs`, defeating the auditability of the existing model.


### Agent Investigation

## Spike Findings

We built and ran an end-to-end spike validating an OpenShell sandbox as a
deployment target for an existing production agent. PR with all artefacts:
https://github.com/cim-data-engineering/agent-toni/pull/17. OpenShell 0.0.36
on both Colima/macOS arm64 and Amazon Linux 2023 / EC2 amd64 (same VPC as
the postgres targets). Both hosts reproduce the failure identically.

### What worked

Every layer of the sandbox model is functional except raw-TCP egress:

- Image build via `--from <staged-dir>` extending
  `ghcr.io/nvidia/openshell-community/sandboxes/base`
- K3s push + pull, supervisor + sshd plumbing, sandbox user (uid 998)
- Filesystem policy under Landlock with mixed read-only / read-write paths
- Process policy with binary symlink resolution covering uv-managed Python
  interpreters
- Network policy hot-reload via `openshell policy set --wait`
- Anthropic API and Slack Web API egress through the L7 proxy
- Slack Socket Mode WebSocket via TCP-passthrough on `wss-primary.slack.com:443`
- Secrets injected at sandbox-create time via `--upload <local>:/sandbox/upload`

### What fails: raw-TCP egress to allow-listed hosts

The agent's data-access path is a Python CLI (`peak-cli`) that the agent
invokes as a subprocess; it opens postgres connections via
`psycopg.connect(host='apidb-...cimenviro.com', port=5432)`. With the
postgres host explicitly named in the policy (`protocol: tcp`,
`enforcement: enforce`, `access: full`) and matching `binaries:` for the
Python interpreters, the connection still fails:

    psycopg.OperationalError: could not translate host name
      "apidb-ap-southeast-2-replica-2.cimenviro.com" to address:
      Temporary failure in name resolution

Confirmed via `openshell sandbox exec`:

    $ cat /etc/resolv.conf
    nameserver 10.43.0.10        # K3s CoreDNS, unreachable from sandbox netns
    options ndots:5

    $ getent hosts ....com
    (no output, NXDOMAIN)
    $ getent hosts api.anthropic.com
    (no output, NXDOMAIN, even though the bot reaches it fine via HTTPS_PROXY)

`openshell logs <sandbox>` shows zero `NET:DENIED` events for the postgres
host. The connection dies at libc `getaddrinfo` before any traffic enters
the sandbox network namespace's egress path. `protocol: tcp` is accepted
by the policy parser and visible in `openshell policy get --full`, but has
no runtime effect.

### Workarounds attempted, blocked

- Pre-resolved `/etc/hosts` baked into the image at build time
  (`COPY hosts.append /tmp/ && cat /tmp/hosts.append >> /etc/hosts`):
  blocked by `Read-only file system` during the OpenShell build. Docker
  normally writes `/etc/hosts` at runtime, but the OpenShell build
  environment locks it down even at build-time.
- `/etc/hosts` rewrite at runtime via the entrypoint script: blocked by
  Landlock; `/etc` sits in our policy's `read_only` list.

### Implementation observations

- Sandbox netns IP `10.200.0.2`, host-side veth at `10.200.0.1`. Proxy
  listens at `10.200.0.1:3128`. iptables NAT could redirect TCP destined
  to allow-listed IPs through this proxy without needing a SOCKS-aware
  client.
- The supervisor logs every connection at `NET:OPEN` / `NET:CLOSE` with
  the binary path, destination, and policy hit. Adding `protocol: tcp`
  traffic to that pipeline (rather than letting it die at libc) would
  give operators the same observability that exists today for HTTP
  egress.
- The supervisor already resolves binary-path symlinks at policy-load
  time. Doing the same once per allow-listed hostname (resolving via the
  gateway's resolver, injecting into `/etc/hosts` or an internal DNS
  forwarder) is a similar shape.


### Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy): resolve policy-allow-listed TCP hostnames inside the sandbox so non-HTTP protocols are reachable #1107

Problem Statement

Proposed Design

Lifecycle

Observability

Out of scope

Alternatives Considered

Sandbox-internal DNS forwarder (NXDOMAIN for non-policy hosts)

SOCKS5 endpoint on the existing proxy

`LD_PRELOAD` / `sitecustomize.py` shims to redirect `getaddrinfo`

Agent Investigation

Spike Findings

What worked

What fails: raw-TCP egress to allow-listed hosts

Workarounds attempted, blocked

Implementation observations

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(proxy): resolve policy-allow-listed TCP hostnames inside the sandbox so non-HTTP protocols are reachable #1107

Description

Problem Statement

Proposed Design

Lifecycle

Observability

Out of scope

Alternatives Considered

Sandbox-internal DNS forwarder (NXDOMAIN for non-policy hosts)

SOCKS5 endpoint on the existing proxy

LD_PRELOAD / sitecustomize.py shims to redirect getaddrinfo

Agent Investigation

Spike Findings

What worked

What fails: raw-TCP egress to allow-listed hosts

Workarounds attempted, blocked

Implementation observations

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`LD_PRELOAD` / `sitecustomize.py` shims to redirect `getaddrinfo`