# claudebox Threat Model

> A personal threat-model doc for one developer running Claude Code on a NixOS workstation, connected via Tailscale to a homelab and a work mail server, occasionally feeding the agent untrusted input.

## TL;DR

Agent security is an **unsolvable problem**, not a hard one. It's a tradeoff between usability and safety with no perfect local solution. You can run the agent on your machine, with your data, and with internet access — pick at most two. Any "secure local sandbox" only buys you protection against honest accidents and casual injection; a determined attacker who can put text in front of the model wins, given enough surface area. The realistic answer is to pick a posture per use case, not chase a single fortress.

---

## 1. The lethal trifecta: the only honest framing

Simon Willison's [lethal trifecta](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/) is the cleanest way to think about this:

1. **Access to private data** — your repo, your `~/.config`, your Tailscale hosts, your mail.
2. **Exposure to untrusted content** — a fetched web page, an issue tracker comment, an MCP server's response, a dependency's README, a stray log line.
3. **Ability to communicate externally** — `curl`, `git push`, an MCP tool calling out, a markdown image render.

If all three are present, **an attacker who can put text anywhere in the model's context can in principle exfiltrate anything the agent can read**. Willison's bottom line: end-users mixing tools cannot rely on vendor protections; the only reliable defense is to "avoid that lethal trifecta combination entirely."

That means: to be *truly* safe, you must remove at least one leg.

- **No private data on the machine.** Throwaway VM, no SSH keys, no `.env`, no Tailscale auth. Practical for evaluating untrusted code. Annoying for everyday refactoring.
- **No untrusted content.** Never let the agent read anything you didn't write — no `WebFetch`, no third-party MCP, no `git pull` from a stranger, no scraping issues. Practical for closed local work.
- **No external comms.** Network egress blocked. Hard to do well: DNS lookups, package installs, even error telemetry can become side-channels.

Anything short of removing a leg is harm reduction, not prevention. Treat it accordingly.

This is the same insight Bruce Schneier circles back to under different names: agents are ["double agents"](https://www.schneier.com/blog/archives/2025/05/privacy-for-agentic-ai.html); ["LLMs are notoriously bad at context-switching"](https://www.schneier.com/blog/archives/2025/12/building-trustworthy-ai-agents.html) between trust domains. The corporate framing changes; the underlying fact doesn't.

---

## 2. Sandbox ≠ isolation. Be precise.

Treat these as different categories with different attack surfaces, not points on a single continuum.

| Boundary           | Mechanism                                  | Breaks when…                                                                  |
| ------------------ | ------------------------------------------ | ----------------------------------------------------------------------------- |
| **Sandbox**        | Shared kernel + namespaces / seccomp / MAC (bwrap, firejail, Docker, Landlock) | Kernel CVE; misconfiguration; bind-mount leak; an env var carrying a secret |
| **VM**             | Own kernel, hypervisor boundary (qemu/KVM, microvm, NixOS VM, Vagrant)         | Hypervisor CVE (rare — Venom-class, virtio escapes); shared-folder leak     |
| **Remote machine** | Different physical host, network is the only attack surface                    | Compromised credentials; network-exposed services; insider                  |
| **Air-gapped**     | No network at all                                                              | Sneakernet; supply-chain in the binaries you copied in                      |

Sandboxes (including `claude --sandbox` and `claudebox`) are **kernel-shared**. A kernel-level vuln, a missed bind-mount, an unscrubbed env var, or a forgotten Tailscale route can defeat them. VMs are stronger; remote hosts stronger still. There is no point pretending bwrap is equivalent to a Fedora VM is equivalent to a $5 VPS.

---

## 3. The posture ladder for one developer

This is the ladder of options for a single dev. For each, name what it actually buys.

### L0 — Bare `claude` on the host
Trusts the agent fully. Reasonable only for a closed offline session where you've personally vetted every input. Buys: nothing structural. Loses: everything if the model is induced to misbehave.

### L1 — `claude --sandbox` (Anthropic's built-in)
[Anthropic's sandboxing](https://www.anthropic.com/engineering/claude-code-sandboxing), [docs](https://code.claude.com/docs/en/sandboxing), [open source runtime](https://github.com/anthropic-experimental/sandbox-runtime). On Linux: bubblewrap. On macOS: seatbelt. Writes restricted to CWD; network egress goes through a domain-allowlisting proxy; `denyRead` for credential dirs.

Buys: protection against the most obvious accidents — Claude can't trivially read `~/.ssh/id_ed25519` or POST to `evil.com`. Anthropic explicitly claims that "even a successful prompt injection is fully isolated."

Doesn't buy: protection against (a) anything Claude can read in CWD (your repo, project `.env`), (b) allowlisted domains being exfil channels, (c) kernel CVEs, (d) MCP servers that themselves have network access.

### L2 — L1 plus claudebox-style hardening
What this project does: mount-namespace credential denial, env scrub, per-project conversation isolation, cgroup + nftables to block internal-network ranges (Tailscale CGNAT, RFC1918) while keeping public internet.

Buys: defense in depth. Closes the "forgot to add `~/.aws` to denyRead" class of mistakes by structuring the mount set as an allowlist. Stops casual reach-into-homelab attacks. Useful against indirect prompt injection on a bad day.

Doesn't buy: anything against a determined targeted attacker. If you process content that's actively trying to attack you, this is *not enough*. Anyone selling local-sandbox-as-real-security is selling.

### L3 — Local VM (qemu/KVM, NixOS VM, microvm, Vagrant)
Own kernel; hypervisor boundary. Friction: filesystem sync (virtio-fs, sshfs), port forwarding, GPU passthrough for some workflows, performance.

Buys: real isolation against kernel-level exploits and the entire "I forgot one mount" class of failures. Snapshots make rollback cheap.

Doesn't buy: protection against the network-exfiltration leg if the VM has internet. Doesn't help against agent doing dumb things to *its own* filesystem (still your code, mounted in).

### L4 — Cloud / remote dev environment
A VPS, a Codespace, a self-hosted [Daytona](https://daytona.io)/Gitpod, or a personal cheap VM you reach over Tailscale or SSH. Code lives there; agent runs there; your laptop is just a terminal. True hardware isolation.

Buys: the agent literally cannot reach your SSH keys, your homelab, your mail server, because they're not on that host. The only attack surface is the network channel back.

Doesn't buy: convenience. Friction: cost, latency, sync (jujutsu/git push-pull dance, or a mutagen/Mutagen-style sync). Setup overhead.

### L5 — Managed agent platforms
[Container Use by Dagger](https://dagger.io/blog/llm/) gives each agent a containerized worktree with filtered egress. [Anthropic's managed agents](https://www.anthropic.com/engineering/managed-agents), E2B, Modal, Vercel Sandbox. Vendor handles the box.

Buys: somebody else's problem. Strong default isolation. Good for "evaluate this random repo from the internet."

Doesn't buy: control. You're trusting the vendor's threat model and breach response. Data residency / privacy may matter.

---

## 4. Consensus practical guidance

Distilled from [NIST AI 600-1](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf), [MITRE ATLAS](https://atlas.mitre.org/), [OWASP Top 10 for LLM Applications 2025](https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/), [OWASP Agentic Security Initiative](https://genai.owasp.org/initiatives/agentic-security-initiative/) / [Top 10 for Agentic Applications 2026](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/), [Greshake et al.](https://arxiv.org/abs/2302.12173) on indirect prompt injection, [StruQ](https://arxiv.org/abs/2402.06363) on structured-query defenses, and the Berkeley BAIR [post on StruQ+SecAlign](https://bair.berkeley.edu/blog/2025/04/11/prompt-injection-defense/):

- **Treat agent outputs as untrusted data.** Don't `eval` model output, don't pipe it to a shell without thought, don't trust file paths or URLs it generates.
- **Confine tool execution.** Filesystem, network, system calls. Layered. Default-deny.
- **Allowlist, not denylist**, for any high-stakes scope. Denylists fail open; allowlists fail closed.
- **Human in the loop for irreversible / high-blast-radius actions**: `git push --force`, package publishing, anything touching `~/`, anything touching prod.
- **No production credentials in the dev environment.** Period. Different keychain, different box if you can.
- **Audit logs outside the agent's reach** — if the agent can rewrite its own history, you have no incident response.
- **Ephemeral environments beat cleanup.** Kill-and-rebuild a VM is more reliable than "did I miss a tempfile?"
- **Assume injection on every untrusted input** — fetched docs, repo READMEs, issue bodies, MCP tool returns. Greshake et al. demonstrated this against real production systems (Bing Chat, code-completion); it's not theoretical.
- **No defense is currently provably robust.** StruQ and Spotlighting raise the bar materially; they do not eliminate the problem. Don't bet sensitive data on a model-side defense.

---

## 5. The line we draw

Given: one developer, NixOS host, Tailscale-connected homelab (`endurance`, `alvin`, `great-western`, `fram`), HAUSGOLD mail server, occasionally untrusted inputs (fetched docs, third-party repos, issue trackers).

**Default posture: L2.** claudebox running over `claude --sandbox`, with:
- credential dirs explicitly mount-denied (not just unreadable — not present in the namespace),
- env scrubbed of `*_TOKEN`, `*_KEY`, `*_SECRET`, Tailscale auth,
- internal-network ranges (Tailscale CGNAT `100.64.0.0/10`, RFC1918) blocked at the network layer,
- public internet allowed,
- per-project conversation isolation so a poisoned context in project A can't leak into project B.

What L2 is for: refactoring my own code, running tests, writing docs, light web research, working with code I trust. Honest-accident protection + a meaningful speed bump against casual indirect injection. Not a fortress.

**Escape hatch: L4** for anything that smells. Triggers:
- Evaluating a repo I didn't write.
- Running an unfamiliar MCP server.
- Anything that ingests untrusted email, issue content, or scraped web pages as part of the *task*, not just background context.
- Any workflow where the agent will plausibly produce a `git push` to a shared remote.

L4 in practice = a cheap VPS reachable over Tailscale, with code synced via git, and **nothing on it that I'd mind losing**. Or [Container Use](https://dagger.io/blog/llm/) for the lower-friction managed version.

**Explicitly out of scope:**
- Building a "perfect local sandbox." It does not exist.
- Pretending claudebox protects me from a targeted attacker. It doesn't.
- Defending against state-level adversaries. Wrong threat model for a personal dev box.

**Explicitly in scope:**
- Reducing blast radius of model misbehavior to "I lost an hour of work" rather than "my SSH keys are on pastebin."
- Making the *easy* path the safer one — fewer footguns, less to remember.
- Knowing which tier I'm in for any given session, and switching deliberately.

---

## 6. Sources

- Simon Willison — [The lethal trifecta for AI agents](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/); ongoing [prompt-injection series](https://simonwillison.net/series/prompt-injection/)
- Anthropic Engineering — [Making Claude Code more secure and autonomous](https://www.anthropic.com/engineering/claude-code-sandboxing); [Auto mode](https://www.anthropic.com/engineering/claude-code-auto-mode)
- Anthropic — [Claude Code Sandboxing docs](https://code.claude.com/docs/en/sandboxing); [sandbox-runtime on GitHub](https://github.com/anthropic-experimental/sandbox-runtime); [Managed Agents](https://www.anthropic.com/engineering/managed-agents)
- NIST — [AI 600-1: Generative AI Profile (July 2024)](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf); [AI RMF program](https://www.nist.gov/itl/ai-risk-management-framework)
- MITRE — [ATLAS](https://atlas.mitre.org/); [advmlthreatmatrix on GitHub](https://github.com/mitre/advmlthreatmatrix)
- OWASP — [Top 10 for LLM Applications 2025](https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/); [Agentic Security Initiative](https://genai.owasp.org/initiatives/agentic-security-initiative/); [Top 10 for Agentic Applications 2026](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/); [Agentic AI Threats and Mitigations](https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/)
- Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz — [Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (AISec '23 / arXiv:2302.12173)](https://arxiv.org/abs/2302.12173)
- Chen et al. — [StruQ: Defending Against Prompt Injection with Structured Queries (arXiv:2402.06363)](https://arxiv.org/abs/2402.06363); Berkeley BAIR — [StruQ + SecAlign](https://bair.berkeley.edu/blog/2025/04/11/prompt-injection-defense/)
- Bruce Schneier — [Privacy for Agentic AI](https://www.schneier.com/blog/archives/2025/05/privacy-for-agentic-ai.html); [Building Trustworthy AI Agents](https://www.schneier.com/blog/archives/2025/12/building-trustworthy-ai-agents.html); [AI and Trust](https://www.schneier.com/academic/archives/2025/06/ai-and-trust.html)
- Practical writeups — [Promptfoo: Testing AI's Lethal Trifecta](https://www.promptfoo.dev/blog/lethal-trifecta-testing/); [HiddenLayer: The Lethal Trifecta and how to defend](https://www.hiddenlayer.com/research/the-lethal-trifecta-and-how-to-defend-against-it); [Dagger: Container Use / LLM primitive](https://dagger.io/blog/llm/)