b4rtaz/distributed-llama
A C++ distributed inference engine that clusters home devices together via tensor parallelism to accelerate LLM serving.

This project implements distributed LLM inference by connecting multiple devices into a cluster, leveraging tensor parallelism and high-speed synchronization over Ethernet to accelerate model serving. It supports running large models like Llama 3 and Qwen 3 across heterogeneous hardware including CPUs, GPUs via Vulkan, and even Raspberry Pi devices. The system provides a single-command setup for running popular quantized models across the distributed cluster.