# claudebox Threat Model > A personal threat-model doc for one developer running Claude Code on a NixOS workstation, connected via Tailscale to a homelab and a work mail server, occasionally feeding the agent untrusted input. ## TL;DR Agent security is an **unsolvable problem**, not a hard one. It's a tradeoff between usability and safety with no perfect local solution. You can run the agent on your machine, with your data, and with internet access — pick at most two. Any "secure local sandbox" only buys you protection against honest accidents and casual injection; a determined attacker who can put text in front of the model wins, given enough surface area. The realistic answer is to pick a posture per use case, not chase a single fortress. --- ## 1. The lethal trifecta: the only honest framing Simon Willison's [lethal trifecta](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/) is the cleanest way to think about this: 1. **Access to private data** — your repo, your `~/.config`, your Tailscale hosts, your mail. 2. **Exposure to untrusted content** — a fetched web page, an issue tracker comment, an MCP server's response, a dependency's README, a stray log line. 3. **Ability to communicate externally** — `curl`, `git push`, an MCP tool calling out, a markdown image render. If all three are present, **an attacker who can put text anywhere in the model's context can in principle exfiltrate anything the agent can read**. Willison's bottom line: end-users mixing tools cannot rely on vendor protections; the only reliable defense is to "avoid that lethal trifecta combination entirely." That means: to be *truly* safe, you must remove at least one leg. - **No private data on the machine.** Throwaway VM, no SSH keys, no `.env`, no Tailscale auth. Practical for evaluating untrusted code. Annoying for everyday refactoring. - **No untrusted content.** Never let the agent read anything you didn't write — no `WebFetch`, no third-party MCP, no `git pull` from a stranger, no scraping issues. Practical for closed local work. - **No external comms.** Network egress blocked. Hard to do well: DNS lookups, package installs, even error telemetry can become side-channels. Anything short of removing a leg is harm reduction, not prevention. Treat it accordingly. This is the same insight Bruce Schneier circles back to under different names: agents are ["double agents"](https://www.schneier.com/blog/archives/2025/05/privacy-for-agentic-ai.html); ["LLMs are notoriously bad at context-switching"](https://www.schneier.com/blog/archives/2025/12/building-trustworthy-ai-agents.html) between trust domains. The corporate framing changes; the underlying fact doesn't. --- ## 2. Sandbox ≠ isolation. Be precise. Treat these as different categories with different attack surfaces, not points on a single continuum. | Boundary | Mechanism | Breaks when… | | ------------------ | ------------------------------------------ | ----------------------------------------------------------------------------- | | **Sandbox** | Shared kernel + namespaces / seccomp / MAC (bwrap, firejail, Docker, Landlock) | Kernel CVE; misconfiguration; bind-mount leak; an env var carrying a secret | | **VM** | Own kernel, hypervisor boundary (qemu/KVM, microvm, NixOS VM, Vagrant) | Hypervisor CVE (rare — Venom-class, virtio escapes); shared-folder leak | | **Remote machine** | Different physical host, network is the only attack surface | Compromised credentials; network-exposed services; insider | | **Air-gapped** | No network at all | Sneakernet; supply-chain in the binaries you copied in | Sandboxes (including `claude --sandbox` and `claudebox`) are **kernel-shared**. A kernel-level vuln, a missed bind-mount, an unscrubbed env var, or a forgotten Tailscale route can defeat them. VMs are stronger; remote hosts stronger still. There is no point pretending bwrap is equivalent to a Fedora VM is equivalent to a $5 VPS. --- ## 3. The posture ladder for one developer This is the ladder of options for a single dev. For each, name what it actually buys. ### L0 — Bare `claude` on the host Trusts the agent fully. Reasonable only for a closed offline session where you've personally vetted every input. Buys: nothing structural. Loses: everything if the model is induced to misbehave. ### L1 — `claude --sandbox` (Anthropic's built-in) [Anthropic's sandboxing](https://www.anthropic.com/engineering/claude-code-sandboxing), [docs](https://code.claude.com/docs/en/sandboxing), [open source runtime](https://github.com/anthropic-experimental/sandbox-runtime). On Linux: bubblewrap. On macOS: seatbelt. Writes restricted to CWD; network egress goes through a domain-allowlisting proxy; `denyRead` for credential dirs. Buys: protection against the most obvious accidents — Claude can't trivially read `~/.ssh/id_ed25519` or POST to `evil.com`. Anthropic explicitly claims that "even a successful prompt injection is fully isolated." Doesn't buy: protection against (a) anything Claude can read in CWD (your repo, project `.env`), (b) allowlisted domains being exfil channels, (c) kernel CVEs, (d) MCP servers that themselves have network access. ### L2 — L1 plus claudebox-style hardening What this project does: mount-namespace credential denial, env scrub, per-project conversation isolation, cgroup + nftables to block internal-network ranges (Tailscale CGNAT, RFC1918) while keeping public internet. Buys: defense in depth. Closes the "forgot to add `~/.aws` to denyRead" class of mistakes by structuring the mount set as an allowlist. Stops casual reach-into-homelab attacks. Useful against indirect prompt injection on a bad day. Doesn't buy: anything against a determined targeted attacker. If you process content that's actively trying to attack you, this is *not enough*. Anyone selling local-sandbox-as-real-security is selling. ### L3 — Local VM (qemu/KVM, NixOS VM, microvm, Vagrant) Own kernel; hypervisor boundary. Friction: filesystem sync (virtio-fs, sshfs), port forwarding, GPU passthrough for some workflows, performance. Buys: real isolation against kernel-level exploits and the entire "I forgot one mount" class of failures. Snapshots make rollback cheap. Doesn't buy: protection against the network-exfiltration leg if the VM has internet. Doesn't help against agent doing dumb things to *its own* filesystem (still your code, mounted in). ### L4 — Cloud / remote dev environment A VPS, a Codespace, a self-hosted [Daytona](https://daytona.io)/Gitpod, or a personal cheap VM you reach over Tailscale or SSH. Code lives there; agent runs there; your laptop is just a terminal. True hardware isolation. Buys: the agent literally cannot reach your SSH keys, your homelab, your mail server, because they're not on that host. The only attack surface is the network channel back. Doesn't buy: convenience. Friction: cost, latency, sync (jujutsu/git push-pull dance, or a mutagen/Mutagen-style sync). Setup overhead. ### L5 — Managed agent platforms [Container Use by Dagger](https://dagger.io/blog/llm/) gives each agent a containerized worktree with filtered egress. [Anthropic's managed agents](https://www.anthropic.com/engineering/managed-agents), E2B, Modal, Vercel Sandbox. Vendor handles the box. Buys: somebody else's problem. Strong default isolation. Good for "evaluate this random repo from the internet." Doesn't buy: control. You're trusting the vendor's threat model and breach response. Data residency / privacy may matter. --- ## 4. Consensus practical guidance Distilled from [NIST AI 600-1](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf), [MITRE ATLAS](https://atlas.mitre.org/), [OWASP Top 10 for LLM Applications 2025](https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/), [OWASP Agentic Security Initiative](https://genai.owasp.org/initiatives/agentic-security-initiative/) / [Top 10 for Agentic Applications 2026](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/), [Greshake et al.](https://arxiv.org/abs/2302.12173) on indirect prompt injection, [StruQ](https://arxiv.org/abs/2402.06363) on structured-query defenses, and the Berkeley BAIR [post on StruQ+SecAlign](https://bair.berkeley.edu/blog/2025/04/11/prompt-injection-defense/): - **Treat agent outputs as untrusted data.** Don't `eval` model output, don't pipe it to a shell without thought, don't trust file paths or URLs it generates. - **Confine tool execution.** Filesystem, network, system calls. Layered. Default-deny. - **Allowlist, not denylist**, for any high-stakes scope. Denylists fail open; allowlists fail closed. - **Human in the loop for irreversible / high-blast-radius actions**: `git push --force`, package publishing, anything touching `~/`, anything touching prod. - **No production credentials in the dev environment.** Period. Different keychain, different box if you can. - **Audit logs outside the agent's reach** — if the agent can rewrite its own history, you have no incident response. - **Ephemeral environments beat cleanup.** Kill-and-rebuild a VM is more reliable than "did I miss a tempfile?" - **Assume injection on every untrusted input** — fetched docs, repo READMEs, issue bodies, MCP tool returns. Greshake et al. demonstrated this against real production systems (Bing Chat, code-completion); it's not theoretical. - **No defense is currently provably robust.** StruQ and Spotlighting raise the bar materially; they do not eliminate the problem. Don't bet sensitive data on a model-side defense. --- ## 5. The line we draw Given: one developer, NixOS host, Tailscale-connected homelab (`endurance`, `alvin`, `great-western`, `fram`), HAUSGOLD mail server, occasionally untrusted inputs (fetched docs, third-party repos, issue trackers). **Default posture: L2.** claudebox running over `claude --sandbox`, with: - credential dirs explicitly mount-denied (not just unreadable — not present in the namespace), - env scrubbed of `*_TOKEN`, `*_KEY`, `*_SECRET`, Tailscale auth, - internal-network ranges (Tailscale CGNAT `100.64.0.0/10`, RFC1918) blocked at the network layer, - public internet allowed, - per-project conversation isolation so a poisoned context in project A can't leak into project B. What L2 is for: refactoring my own code, running tests, writing docs, light web research, working with code I trust. Honest-accident protection + a meaningful speed bump against casual indirect injection. Not a fortress. **Escape hatch: L4** for anything that smells. Triggers: - Evaluating a repo I didn't write. - Running an unfamiliar MCP server. - Anything that ingests untrusted email, issue content, or scraped web pages as part of the *task*, not just background context. - Any workflow where the agent will plausibly produce a `git push` to a shared remote. L4 in practice = a cheap VPS reachable over Tailscale, with code synced via git, and **nothing on it that I'd mind losing**. Or [Container Use](https://dagger.io/blog/llm/) for the lower-friction managed version. **Explicitly out of scope:** - Building a "perfect local sandbox." It does not exist. - Pretending claudebox protects me from a targeted attacker. It doesn't. - Defending against state-level adversaries. Wrong threat model for a personal dev box. **Explicitly in scope:** - Reducing blast radius of model misbehavior to "I lost an hour of work" rather than "my SSH keys are on pastebin." - Making the *easy* path the safer one — fewer footguns, less to remember. - Knowing which tier I'm in for any given session, and switching deliberately. --- ## 6. Sources - Simon Willison — [The lethal trifecta for AI agents](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/); ongoing [prompt-injection series](https://simonwillison.net/series/prompt-injection/) - Anthropic Engineering — [Making Claude Code more secure and autonomous](https://www.anthropic.com/engineering/claude-code-sandboxing); [Auto mode](https://www.anthropic.com/engineering/claude-code-auto-mode) - Anthropic — [Claude Code Sandboxing docs](https://code.claude.com/docs/en/sandboxing); [sandbox-runtime on GitHub](https://github.com/anthropic-experimental/sandbox-runtime); [Managed Agents](https://www.anthropic.com/engineering/managed-agents) - NIST — [AI 600-1: Generative AI Profile (July 2024)](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf); [AI RMF program](https://www.nist.gov/itl/ai-risk-management-framework) - MITRE — [ATLAS](https://atlas.mitre.org/); [advmlthreatmatrix on GitHub](https://github.com/mitre/advmlthreatmatrix) - OWASP — [Top 10 for LLM Applications 2025](https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/); [Agentic Security Initiative](https://genai.owasp.org/initiatives/agentic-security-initiative/); [Top 10 for Agentic Applications 2026](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/); [Agentic AI Threats and Mitigations](https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/) - Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz — [Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (AISec '23 / arXiv:2302.12173)](https://arxiv.org/abs/2302.12173) - Chen et al. — [StruQ: Defending Against Prompt Injection with Structured Queries (arXiv:2402.06363)](https://arxiv.org/abs/2402.06363); Berkeley BAIR — [StruQ + SecAlign](https://bair.berkeley.edu/blog/2025/04/11/prompt-injection-defense/) - Bruce Schneier — [Privacy for Agentic AI](https://www.schneier.com/blog/archives/2025/05/privacy-for-agentic-ai.html); [Building Trustworthy AI Agents](https://www.schneier.com/blog/archives/2025/12/building-trustworthy-ai-agents.html); [AI and Trust](https://www.schneier.com/academic/archives/2025/06/ai-and-trust.html) - Practical writeups — [Promptfoo: Testing AI's Lethal Trifecta](https://www.promptfoo.dev/blog/lethal-trifecta-testing/); [HiddenLayer: The Lethal Trifecta and how to defend](https://www.hiddenlayer.com/research/the-lethal-trifecta-and-how-to-defend-against-it); [Dagger: Container Use / LLM primitive](https://dagger.io/blog/llm/)