← all repositories
vndee/llm-sandbox

A firecracker for LLM-generated code: run it, don't trust it

Python library that wraps containers around AI-written code so you can execute without executing yourself.

1.1k stars Python Coding AssistantsAgents
llm-sandbox
Velocity · 7d
+1.5
★ / day
Trend
steady
star history

What it does LLM Sandbox is a Python library that spins up isolated containers to run code produced by large language models. You pass a string of Python, JavaScript, Java, C++, Go, or R; it handles dependency installation, execution, and cleanup inside Docker, Kubernetes, or Podman. The API is a context manager: with SandboxSession(lang="python") as session: and you’re off.

The interesting bit The library treats LLM output as inherently radioactive. It doesn’t just exec the string—it negotiates with container runtimes, manages on-the-fly package installs, and can extract artifacts like matplotlib plots as base64 without ever letting the code touch your host filesystem. The new MCP server integration means Claude Desktop can now delegate code execution to a locked-down sandbox rather than hoping for the best.

Key highlights

  • Multi-language support: Python, JavaScript/Node.js, Java, C++, Go, R with automatic dependency resolution (pip, npm, Maven/Gradle, CRAN)
  • Three container backends: Docker, Kubernetes, Podman, plus remote Docker host support with TLS
  • Interactive sessions via InteractiveSandboxSession that keep an IPython kernel alive across multiple run() calls—state persists like a notebook cell
  • ArtifactSandboxSession captures generated plots and visualizations automatically
  • Container pooling to pre-warm and reuse environments for faster startup
  • File copy to/from sandbox, custom images, custom Dockerfiles, and resource limits (CPU, memory, time, network)
  • MCP server support for Model Context Protocol clients

Caveats

  • Interactive sessions currently only support Python; other languages get fresh state per run()
  • The README is enthusiastic about “security policies” but doesn’t detail how to define or enforce them
  • Kubernetes backend requires passing raw pod manifests; no higher-level abstraction

Verdict Worth a look if you’re building agents, code evaluation pipelines, or LLM tools that need to actually run the code they generate. Skip it if you already have a hardened internal sandbox or only need single-language execution with no isolation requirements.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.