README gains a scope section linking to two new docs: GUARANTEES.md (mechanism-level reasoning behind hard guarantees) and THREAT-MODEL.md (posture ladder, lethal-trifecta framing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
claudebox Threat Model
A personal threat-model doc for one developer running Claude Code on a NixOS workstation, connected via Tailscale to a homelab and a work mail server, occasionally feeding the agent untrusted input.
TL;DR
Agent security is an unsolvable problem, not a hard one. It's a tradeoff between usability and safety with no perfect local solution. You can run the agent on your machine, with your data, and with internet access — pick at most two. Any "secure local sandbox" only buys you protection against honest accidents and casual injection; a determined attacker who can put text in front of the model wins, given enough surface area. The realistic answer is to pick a posture per use case, not chase a single fortress.
1. The lethal trifecta: the only honest framing
Simon Willison's lethal trifecta is the cleanest way to think about this:
- Access to private data — your repo, your
~/.config, your Tailscale hosts, your mail. - Exposure to untrusted content — a fetched web page, an issue tracker comment, an MCP server's response, a dependency's README, a stray log line.
- Ability to communicate externally —
curl,git push, an MCP tool calling out, a markdown image render.
If all three are present, an attacker who can put text anywhere in the model's context can in principle exfiltrate anything the agent can read. Willison's bottom line: end-users mixing tools cannot rely on vendor protections; the only reliable defense is to "avoid that lethal trifecta combination entirely."
That means: to be truly safe, you must remove at least one leg.
- No private data on the machine. Throwaway VM, no SSH keys, no
.env, no Tailscale auth. Practical for evaluating untrusted code. Annoying for everyday refactoring. - No untrusted content. Never let the agent read anything you didn't write — no
WebFetch, no third-party MCP, nogit pullfrom a stranger, no scraping issues. Practical for closed local work. - No external comms. Network egress blocked. Hard to do well: DNS lookups, package installs, even error telemetry can become side-channels.
Anything short of removing a leg is harm reduction, not prevention. Treat it accordingly.
This is the same insight Bruce Schneier circles back to under different names: agents are "double agents"; "LLMs are notoriously bad at context-switching" between trust domains. The corporate framing changes; the underlying fact doesn't.
2. Sandbox ≠ isolation. Be precise.
Treat these as different categories with different attack surfaces, not points on a single continuum.
| Boundary | Mechanism | Breaks when… |
|---|---|---|
| Sandbox | Shared kernel + namespaces / seccomp / MAC (bwrap, firejail, Docker, Landlock) | Kernel CVE; misconfiguration; bind-mount leak; an env var carrying a secret |
| VM | Own kernel, hypervisor boundary (qemu/KVM, microvm, NixOS VM, Vagrant) | Hypervisor CVE (rare — Venom-class, virtio escapes); shared-folder leak |
| Remote machine | Different physical host, network is the only attack surface | Compromised credentials; network-exposed services; insider |
| Air-gapped | No network at all | Sneakernet; supply-chain in the binaries you copied in |
Sandboxes (including claude --sandbox and claudebox) are kernel-shared. A kernel-level vuln, a missed bind-mount, an unscrubbed env var, or a forgotten Tailscale route can defeat them. VMs are stronger; remote hosts stronger still. There is no point pretending bwrap is equivalent to a Fedora VM is equivalent to a $5 VPS.
3. The posture ladder for one developer
This is the ladder of options for a single dev. For each, name what it actually buys.
L0 — Bare claude on the host
Trusts the agent fully. Reasonable only for a closed offline session where you've personally vetted every input. Buys: nothing structural. Loses: everything if the model is induced to misbehave.
L1 — claude --sandbox (Anthropic's built-in)
Anthropic's sandboxing, docs, open source runtime. On Linux: bubblewrap. On macOS: seatbelt. Writes restricted to CWD; network egress goes through a domain-allowlisting proxy; denyRead for credential dirs.
Buys: protection against the most obvious accidents — Claude can't trivially read ~/.ssh/id_ed25519 or POST to evil.com. Anthropic explicitly claims that "even a successful prompt injection is fully isolated."
Doesn't buy: protection against (a) anything Claude can read in CWD (your repo, project .env), (b) allowlisted domains being exfil channels, (c) kernel CVEs, (d) MCP servers that themselves have network access.
L2 — L1 plus claudebox-style hardening
What this project does: mount-namespace credential denial, env scrub, per-project conversation isolation, cgroup + nftables to block internal-network ranges (Tailscale CGNAT, RFC1918) while keeping public internet.
Buys: defense in depth. Closes the "forgot to add ~/.aws to denyRead" class of mistakes by structuring the mount set as an allowlist. Stops casual reach-into-homelab attacks. Useful against indirect prompt injection on a bad day.
Doesn't buy: anything against a determined targeted attacker. If you process content that's actively trying to attack you, this is not enough. Anyone selling local-sandbox-as-real-security is selling.
L3 — Local VM (qemu/KVM, NixOS VM, microvm, Vagrant)
Own kernel; hypervisor boundary. Friction: filesystem sync (virtio-fs, sshfs), port forwarding, GPU passthrough for some workflows, performance.
Buys: real isolation against kernel-level exploits and the entire "I forgot one mount" class of failures. Snapshots make rollback cheap.
Doesn't buy: protection against the network-exfiltration leg if the VM has internet. Doesn't help against agent doing dumb things to its own filesystem (still your code, mounted in).
L4 — Cloud / remote dev environment
A VPS, a Codespace, a self-hosted Daytona/Gitpod, or a personal cheap VM you reach over Tailscale or SSH. Code lives there; agent runs there; your laptop is just a terminal. True hardware isolation.
Buys: the agent literally cannot reach your SSH keys, your homelab, your mail server, because they're not on that host. The only attack surface is the network channel back.
Doesn't buy: convenience. Friction: cost, latency, sync (jujutsu/git push-pull dance, or a mutagen/Mutagen-style sync). Setup overhead.
L5 — Managed agent platforms
Container Use by Dagger gives each agent a containerized worktree with filtered egress. Anthropic's managed agents, E2B, Modal, Vercel Sandbox. Vendor handles the box.
Buys: somebody else's problem. Strong default isolation. Good for "evaluate this random repo from the internet."
Doesn't buy: control. You're trusting the vendor's threat model and breach response. Data residency / privacy may matter.
4. Consensus practical guidance
Distilled from NIST AI 600-1, MITRE ATLAS, OWASP Top 10 for LLM Applications 2025, OWASP Agentic Security Initiative / Top 10 for Agentic Applications 2026, Greshake et al. on indirect prompt injection, StruQ on structured-query defenses, and the Berkeley BAIR post on StruQ+SecAlign:
- Treat agent outputs as untrusted data. Don't
evalmodel output, don't pipe it to a shell without thought, don't trust file paths or URLs it generates. - Confine tool execution. Filesystem, network, system calls. Layered. Default-deny.
- Allowlist, not denylist, for any high-stakes scope. Denylists fail open; allowlists fail closed.
- Human in the loop for irreversible / high-blast-radius actions:
git push --force, package publishing, anything touching~/, anything touching prod. - No production credentials in the dev environment. Period. Different keychain, different box if you can.
- Audit logs outside the agent's reach — if the agent can rewrite its own history, you have no incident response.
- Ephemeral environments beat cleanup. Kill-and-rebuild a VM is more reliable than "did I miss a tempfile?"
- Assume injection on every untrusted input — fetched docs, repo READMEs, issue bodies, MCP tool returns. Greshake et al. demonstrated this against real production systems (Bing Chat, code-completion); it's not theoretical.
- No defense is currently provably robust. StruQ and Spotlighting raise the bar materially; they do not eliminate the problem. Don't bet sensitive data on a model-side defense.
5. The line we draw
Given: one developer, NixOS host, Tailscale-connected homelab (endurance, alvin, great-western, fram), HAUSGOLD mail server, occasionally untrusted inputs (fetched docs, third-party repos, issue trackers).
Default posture: L2. claudebox running over claude --sandbox, with:
- credential dirs explicitly mount-denied (not just unreadable — not present in the namespace),
- env scrubbed of
*_TOKEN,*_KEY,*_SECRET, Tailscale auth, - internal-network ranges (Tailscale CGNAT
100.64.0.0/10, RFC1918) blocked at the network layer, - public internet allowed,
- per-project conversation isolation so a poisoned context in project A can't leak into project B.
What L2 is for: refactoring my own code, running tests, writing docs, light web research, working with code I trust. Honest-accident protection + a meaningful speed bump against casual indirect injection. Not a fortress.
Escape hatch: L4 for anything that smells. Triggers:
- Evaluating a repo I didn't write.
- Running an unfamiliar MCP server.
- Anything that ingests untrusted email, issue content, or scraped web pages as part of the task, not just background context.
- Any workflow where the agent will plausibly produce a
git pushto a shared remote.
L4 in practice = a cheap VPS reachable over Tailscale, with code synced via git, and nothing on it that I'd mind losing. Or Container Use for the lower-friction managed version.
Explicitly out of scope:
- Building a "perfect local sandbox." It does not exist.
- Pretending claudebox protects me from a targeted attacker. It doesn't.
- Defending against state-level adversaries. Wrong threat model for a personal dev box.
Explicitly in scope:
- Reducing blast radius of model misbehavior to "I lost an hour of work" rather than "my SSH keys are on pastebin."
- Making the easy path the safer one — fewer footguns, less to remember.
- Knowing which tier I'm in for any given session, and switching deliberately.
6. Sources
- Simon Willison — The lethal trifecta for AI agents; ongoing prompt-injection series
- Anthropic Engineering — Making Claude Code more secure and autonomous; Auto mode
- Anthropic — Claude Code Sandboxing docs; sandbox-runtime on GitHub; Managed Agents
- NIST — AI 600-1: Generative AI Profile (July 2024); AI RMF program
- MITRE — ATLAS; advmlthreatmatrix on GitHub
- OWASP — Top 10 for LLM Applications 2025; Agentic Security Initiative; Top 10 for Agentic Applications 2026; Agentic AI Threats and Mitigations
- Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz — Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (AISec '23 / arXiv:2302.12173)
- Chen et al. — StruQ: Defending Against Prompt Injection with Structured Queries (arXiv:2402.06363); Berkeley BAIR — StruQ + SecAlign
- Bruce Schneier — Privacy for Agentic AI; Building Trustworthy AI Agents; AI and Trust
- Practical writeups — Promptfoo: Testing AI's Lethal Trifecta; HiddenLayer: The Lethal Trifecta and how to defend; Dagger: Container Use / LLM primitive