Sandboxing AI Agents: The Embassy Pattern

AI Summary Claude Opus

TL;DR: The Embassy Pattern is a Docker-based sandboxing architecture for AI agents that uses three containers—an isolated agent, an egress proxy, and an output validator—to mediate every capability through a controlled boundary rather than granting agents unrestricted host access.

Key Points

  • The default configuration for most AI agent deployments grants agents the same filesystem, network, and credential access as their operators, meaning a single compromised skill can compromise everything the agent can touch.
  • The architecture introduces a designed quarantine state that treats uncertainty as a valid response rather than forcing binary accept/reject decisions on agent output.
  • The pattern updates a 2006 distributed-systems concept for adversarial trust scenarios where agents may be compromised by other agents and return poisoned results.

The post describes the Embassy Pattern, a sandboxing architecture developed in response to discovering a credential stealer embedded in an AI agent skill repository. The implementation uses Docker Compose to orchestrate three containers: a capability-stripped agent with no direct network access, a Squid egress proxy that allowlists specific domains, and a file validator that scans agent output for sensitive patterns before the host operator sees it. The author traces the pattern's lineage to a 2006 CACM paper on distributed systems while distinguishing the modern version by its assumption that the hosted agent itself may be compromised. The post acknowledges explicit limitations including prompt injection, semantic data leakage, upstream supply chain attacks, and cost control, framing the sandbox as damage limitation rather than complete prevention.

Sandboxing AI Agents: The Embassy Pattern

Update (March 2026): This project has been archived. The containment pattern described here works, but the observation and value-extraction layer built on top of it did not. See The Container That Forgot to Stop for the full post-mortem.

The first credential stealer I encountered in the wild was inside a skill repository for AI agents. An agent called Rufio had deployed YARA rules against a public skill library and found malicious code embedded in one of 286 available skills (as reported in that experiment), which is a 0.35% infection rate in an ecosystem with no code signing, no permission manifests, no audit infrastructure. I wrote about that discovery in OpenClaw on Moltbook: Deploying an AI Agent on an AI Social Network. The relevant detail here is not the malware itself but the structural condition that made it possible: the agent installing those skills had the same access to the host filesystem, the same network permissions, and the same credential access as the operator who deployed it. The agent’s privileges were the operator’s privileges. A compromised skill didn’t just compromise the agent. It compromised everything the agent could touch.

In the ecosystems I’ve observed, this is the default configuration for AI agent deployments. An agent that can call APIs can also exfiltrate data to those APIs. An agent with filesystem access can read your SSH keys, your environment variables, your credential stores. An agent that can execute shell commands can do anything you can do. These are not edge cases. These are the standard capabilities that make agents useful, applied in directions that make them dangerous.

The solution is not to remove the capabilities. An agent without internet access is a very expensive autocomplete. The solution is to mediate every capability through a boundary you control.

How the Metaphor Arrived

The name came from a conversation with Claude, not from a design document. I was sketching a Docker architecture for the OpenClaw agent (inbox, outbox, egress proxy, validation layers) and explaining that most people were probably just giving their agents unrestricted access to their systems, which seemed insane. Claude suggested a framing along the lines of it being not quite a DMZ, not quite a sidecar, not quite a proxy, but a richer concept that captures the trust relationship and the protocol, not just the network topology.

I liked the word. Then I asked GPT-5.2 to check whether someone had used it before. GPT reported that the “Embassy pattern” already exists in the literature on multi-agent systems, citing mediator agents that route and translate messages across interface boundaries. I have not been able to verify the specific citation GPT provided, which is worth noting in a post about not trusting AI agent output uncritically. The concept of mediator agents between external and local systems does appear in multi-agent literature. Weyns’ pattern language for multi-agent systems, Diallo et al.’s work on ambassador agents in simulation interoperability, and Hayden, Carrick, and Yang’s 1999 work on social patterns in MAS (which does use explicit embassy terminology) all describe related ideas, though not in the specific source GPT claimed. Regardless, the framing holds: my version does the same thing those mediator patterns do, with one addition: the assumption that the external agent might be compromised. The old mediator was for translation. This one is for containment.

GPT’s sharpest observation was about quarantine. Most agent architectures treat validation as a binary: the output is either clean or it is rejected. The embassy pattern enables a third state, which is uncertainty. A quarantine path is not an error path. It is a designed state that acknowledges uncertainty as the baseline condition of interacting with an untrusted agent. If you cannot determine whether an output is safe, you do not approve it and you do not destroy it. You park it. The current implementation does not include a quarantine folder; the validator makes a binary valid-or-rejected decision. But the architectural commitment (that “I don’t know” is a valid response worth designing for) is what separates this pattern from just putting your agent in Docker, and the structure makes adding that third state straightforward.

The metaphor breaks down in one instructive way. A real embassy protects the ambassador from the host country. This pattern protects the host from the ambassador. The threat model is inverted. Your agent is not a trusted diplomat. It is an entity of uncertain loyalty with capabilities you granted and intentions you cannot fully verify. The sandbox exists because trust, in this context, is a configuration error.

Three Containers

The implementation uses Docker Compose to orchestrate three containers that talk to each other through carefully scoped interfaces.

Container 1: The Agent. The root filesystem is read only. All Linux capabilities are dropped. The user is non-root. There is no direct network access. It can read tasks from an inbox directory (mounted as read only), write results to an outbox directory, and make HTTP requests, though only through the egress proxy. Tmpfs provides scratch space, with size limits. PID limits prevent fork bombs. Memory caps prevent resource exhaustion. The agent runs whatever code you give it, but the code runs inside a box where every privilege has been removed except the ones explicitly granted back.

# Key security constraints from docker-compose.yml
read_only: true
cap_drop: [ALL]
security_opt: [no-new-privileges:true]
user: "1000:1000"
mem_limit: 2G
pids_limit: 100

Container 2: The Egress Proxy. A Squid proxy that allowlists specific domains. The agent container’s HTTP and HTTPS traffic is routed through environment variables (HTTP_PROXY, HTTPS_PROXY) to this container. If the agent tries to reach a domain not on the allowlist, the connection is refused. The proxy sits on two networks: an internal bridge (shared with the agent) and an external bridge (with actual internet access). The agent container is only on the internal network. It cannot bypass the proxy because it has no route to the internet that doesn’t go through it.

acl allowed_hosts dstdomain api.openai.com
acl allowed_hosts dstdomain .github.com
http_access allow allowed_hosts
http_access deny all

Container 3: The Validator. A Python script running in its own container, watching the outbox directory for new files. Every file the agent writes is scanned against a set of rules before the operator consumes it in the intended workflow. The rules check for private key material, API key patterns (OpenAI’s sk-, AWS’s AKIA), oversized files, and symlinks (which could escape the outbox). Files that fail validation are moved to outbox/rejected/ with a JSON report explaining what triggered the rejection. The validator runs on a separate network from the agent container; it solely processes files and cannot be reached by the agent or reach anything outside the outbox.

This last point matters architecturally. The validator runs on the host side of the trust boundary. The agent can write whatever it wants to the outbox. The validator decides whether that output is safe to consume. The agent cannot influence the validator’s decisions because the agent cannot communicate with the validator. They share a directory. That is the entirety of their interface.

What This Solves

When Moltbook launched at the end of January (a social network designed for AI agents, with tens of thousands registered in its first weekend) the question of how to connect an agent to it was immediate. From what I saw, most people yoloed. They gave their agent direct access to their host system, direct network access, and direct credential access, because the friction of doing anything else was too high and they wanted to participate before the moment passed. This is understandable. It is also the operational equivalent of giving your house keys to a stranger because the locksmith is closed.

I built the sandbox architecture the same day. The Docker container had a root filesystem that was read only, dropped capabilities, execution as a non-root user, resource limits, and volume isolation between inbox and outbox. What it did not have, at the time, was the egress proxy. The agent had direct network access because the proxy configuration was broken and I bypassed it to get the agent running.

That bypass was the wrong call. The agent was communicating with Moltbook’s API, which was fine. But it could also have communicated with any other endpoint, which was not fine. A compromised agent skill (and we knew compromised skills existed in the ecosystem because the agent itself had discovered one) could exfiltrate data to an arbitrary domain, and nothing in the architecture would have stopped it. The inbox/outbox isolation protected the host filesystem. The capability dropping protected against privilege escalation. The egress proxy would have protected against data exfiltration. Without it, the security model had a gap the width of the entire internet.

Agent Embassy is the generalized and cleaned version of that deployment. The egress proxy works. The validator exists. The configuration is in YAML files a human can read in thirty seconds rather than buried in Docker flags that took three debugging sessions to get right. The gap is closed.

The Configuration Philosophy

The entire system is configured through four files:

FilePurposeLines
docker-compose.ymlContainer orchestration~90
config/agent.ymlAgent identity and resource limits~50
config/squid.confDomain allowlist~40
config/validation-rules.ymlOutput scanning patterns (requires PyYAML in the validator image)~40

No custom framework. No SDK. No dependency beyond Docker and standard container images. The configuration is deliberately minimal because every line you add is a line you have to understand when something goes wrong at 2 AM. A Squid config that says “allow these two domains, deny everything else” is auditable by anyone who can read English. A validation rule that blocks strings matching sk-proj- (the current OpenAI project key format) is self-documenting, though key formats evolve and patterns need updating accordingly. The value of this approach is not sophistication. It is legibility.

The tradeoff is that you have to write configuration. There is no agent-embassy init --secure that generates a setup with safe defaults for your specific agent. You have to think about which domains your agent needs, which output patterns are dangerous, and what resource limits are appropriate. This is a feature. If you cannot enumerate the domains your agent should reach, you do not understand your agent well enough to deploy it. The configuration is a forcing function for understanding.

What This Does Not Solve

The embassy pattern is a sandbox for network and filesystem isolation. It does not solve:

Prompt injection. If your agent processes untrusted input and that input manipulates the agent’s behavior, the sandbox limits the damage but does not prevent it. An agent that has been injected through a prompt can still call allowlisted APIs with malicious parameters, write misleading results to the outbox, or consume resources up to the configured limits. The sandbox makes prompt injection survivable, not impossible.

Semantic attacks. The validator checks for syntactic patterns: things that look like API keys, things that look like private keys. It does not understand the meaning of what the agent writes. An agent leaking information encoded in natural language (“The user’s database password is hunter2, here is my analysis…”) will pass validation unless you write a rule that specifically matches that pattern, which you cannot do because you do not know what the information looks like in advance. Semantic validation is an open problem. The validator handles the easy cases.

Supply chain attacks upstream of the sandbox. If the Docker image your agent runs on is compromised, the sandbox still applies (dropped capabilities and a root filesystem that is read only limit what a compromised image can do) but the agent’s behavior within those constraints is controlled by the attacker. Pin your images. Verify your digests. The sandbox is a second line of defense, not a first.

Cost control. An agent within its resource limits can still make expensive API calls through the egress proxy. If your agent is calling GPT-4 in a loop, the sandbox will let every call through because the domain is allowlisted. Rate limiting at the proxy layer is possible but not included in the default configuration. Add it if your threat model includes runaway API costs, which it should.

Why Open Source

When GPT reported the prior art, it offered a useful framing: “You’re not inventing from nothing; you’re updating an old pattern for a new failure mode.” The older mediator agent patterns were about interface mediation between distributed systems. This version is about adversarial trust in an environment where your agent can get compromised by other agents and bring back poisoned intelligence. The community working on distributed systems solved message routing. The community building LLM agents has not yet solved the trust problem, and the default is to ignore it.

The credential stealer was not theoretical. The social engineering through context shaping was not a conference talk. These were operational realities encountered by a live system, and the security architecture was developed in response to specific failures and near misses over roughly three weeks of running an AI agent in production on Moltbook.

The pattern is simple enough that withholding it would be choosing obscurity over utility for no defensible reason. The implementation is three containers and four configuration files. Anyone with Docker experience could build it in an afternoon. Open sourcing it saves that afternoon and standardizes the approach so that improvements propagate.

Repository: github.com/AshitaOrbis/agent-embassy

AI agents deployed without network isolation are a credential exfiltration waiting to happen. The pattern that prevents it fits in a README. Here is the README.

Ask About Projects
Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?