Is distributed-llama open source?

Yes — b4rtaz/distributed-llama is open source, released under the MIT license.

What language is distributed-llama written in?

b4rtaz/distributed-llama is primarily written in C++.

How popular is distributed-llama?

b4rtaz/distributed-llama has 3k stars on GitHub.

Where can I find distributed-llama?

b4rtaz/distributed-llama is on GitHub at https://github.com/b4rtaz/distributed-llama.

← all repositories

b4rtaz/distributed-llama

Run 70B Models on a Cluster of Mac Minis and Raspberry Pis

Distributed Llama splits large language models across home devices over Ethernet so one machine’s RAM and CPU no longer dictate your model size.

★3k stars C++ Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Distributed Llama is a C++ inference engine that shards a neural network across a root node and up to 2^n - 1 workers on a local network. It uses tensor parallelism and synchronizes layers over Ethernet, letting you pool RAM and CPU cores from disparate machines—Mac Minis, Raspberry Pis, Linux boxes, or Windows desktops—to run models that would choke on one device. The root node handles model weights and state synchronization while also crunching its own slice of the network.

The interesting bit

The project treats your LAN like a high-speed fabric for a compute cluster rather than a mere file-sharing pipe. Workers are deliberately stateless: they need no model configuration, just an IP and port, so you can add or remove commodity hardware without reconfiguring the model itself.

Key highlights

Supports Llama 3.x, DeepSeek R1 Distill, and Qwen 3 families, including MoE models on CPU and experimental Vulkan.
Optimized for ARM and x86_64 AVX2; runs on Linux, macOS, and Windows.
Splits both computation and RAM usage across nodes, with the root node bearing a slightly heavier memory load.
Provides CLI chat, benchmark inference, a worker daemon, and an API server mode.
Ships with a launcher that downloads compatible models and tokenizers automatically.

Caveats

Node counts are strictly powers of two (1, 2, 4, 8…) and cannot exceed the number of KV heads in the chosen model.
Only a narrow set of quantization pairings are currently supported: q40 models with q80 buffer types, or fully f32 configurations.
Vulkan support is explicitly marked experimental.

Verdict

Tinkerers with a shelf of underutilized ARM boards or homelab boxes finally have a practical excuse to wire them together. If you already own a single high-end GPU or a workstation with enough RAM, this adds complexity you probably do not need.

Frequently asked

What is b4rtaz/distributed-llama?: Distributed Llama splits large language models across home devices over Ethernet so one machine’s RAM and CPU no longer dictate your model size.
Is distributed-llama open source?: Yes — b4rtaz/distributed-llama is open source, released under the MIT license.
What language is distributed-llama written in?: b4rtaz/distributed-llama is primarily written in C++.
How popular is distributed-llama?: b4rtaz/distributed-llama has 3k stars on GitHub.
Where can I find distributed-llama?: b4rtaz/distributed-llama is on GitHub at https://github.com/b4rtaz/distributed-llama.