ChatGPT in a Docker container, minus the cloud and the NDAs
A one-click, offline Llama 2 chatbot that keeps your prompts on your hardware.

What it does
LlamaGPT wraps Llama 2 and Code Llama models in a familiar ChatGPT-style web UI, served locally via Docker. It includes an OpenAI-compatible API at localhost:3001, so existing tools can point at your basement server instead of someone else’s GPU farm. The project is maintained by Umbrel, makers of a home-server OS, and it shows: deployment targets include umbrelOS, M1/M2 Macs, generic x86/arm64 Docker hosts, and Kubernetes.
The interesting bit The real work here is packaging, not model training. LlamaGPT glues together McKay Wrigley’s Chatbot UI, Georgi Gerganov’s llama.cpp, and Andrei’s Python bindings, then adds automated model downloads and hardware-specific run scripts. The benchmarks are unusually honest: a Raspberry Pi 4 manages 0.9 tokens/sec on the 7B model, while an M1 Max hits 54 tokens/sec. You know exactly what you’re getting into.
Key highlights
- Ships quantized models from 7B to 70B (and Code Llama variants), with memory requirements clearly listed
- CUDA support for Nvidia GPUs; Metal support for Apple Silicon
- Kubernetes manifests included for cluster deployments
- OpenAI-compatible API with auto-generated docs at
/docs - One-click install via umbrelOS App Store
Caveats
- Custom models and runtime model switching are on the roadmap but not yet implemented
- First launch downloads multi-gigabyte models and may appear hung for several minutes
- Benchmarks only cover M1 Max MacBook Pro for Code Llama models; other hardware is untested
Verdict Good fit for privacy-paranoid developers, homelabbers, or anyone whose internet is unreliable. Skip it if you need model flexibility today or if your hardware is closer to the Pi 4 than the M1 Max.