Is OnnxStream open source?

Yes — vitoplantamura/OnnxStream is an open-source project tracked on heatdrop.

What language is OnnxStream written in?

vitoplantamura/OnnxStream is primarily written in C++.

How popular is OnnxStream?

vitoplantamura/OnnxStream has 2.1k stars on GitHub.

Where can I find OnnxStream?

vitoplantamura/OnnxStream is on GitHub at https://github.com/vitoplantamura/OnnxStream.

← all repositories

vitoplantamura/OnnxStream

An inference engine that starves RAM on purpose

A C++ ONNX inference library built to minimize memory above all else, enabling billion-parameter models to run on devices like the 512MB Raspberry Pi Zero 2.

★2.1k stars C++ Inference · Serving ML Frameworks

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

OnnxStream is a compact C++ inference engine for ONNX models that deliberately minimizes memory usage instead of chasing raw speed. It splits the inference engine from weight loading through a WeightsProvider abstraction, so weights can stream from disk, be prefetched, or sit in RAM as needed. The library handles Stable Diffusion 1.5 and XL, Mistral 7B, TinyLlama, YOLOv8, and Whisper across ARM, x86, WASM, and RISC-V, using XNNPACK for acceleration.

The interesting bit

The architecture treats model weights as a stream rather than a preload. Dynamic and static quantization, attention slicing, and tiled VAE decoding let OnnxStream run SDXL 1.0 in under 300MB of RAM—down from a typical 12GB VRAM recommendation—though a 10-step image takes about 11 hours on a Pi Zero 2. A custom WeightsProvider can even pull parameters over HTTP without writing to local disk, which is the literal reason for the “Stream” in the name.

Key highlights

Runs SDXL 1.0 in 298MB RAM on a Raspberry Pi Zero 2 (512MB total) without swap or disk offloading
Claims 55× lower memory than OnnxRuntime for SD 1.5’s UNET, with a 50%–200% latency increase on CPU
Supports 41 common ONNX operators, FP16, and 8-bit asymmetric quantization with percentile calibration
Ships as a single implementation file plus header, with Python, C#, and JavaScript/WASM bindings
GPU support via cuBLAS is available but currently limited to FP16/FP32 and only the LLM application

Caveats

GPU acceleration is initial and restricted to the LLM chat app, leaving most workloads CPU-bound
Only 41 ONNX operators are implemented, so less common architectures may fail to load
The engine executes operations sequentially; while most individual operators are multithreaded, graph-level parallelism is not the goal

Verdict

Reach for OnnxStream when you need to squeeze large models into small devices, browser WASM sandboxes, or RAM-constrained servers. Look elsewhere if your priority is low latency, full ONNX compliance, or broad GPU acceleration.

Frequently asked

What is vitoplantamura/OnnxStream?: A C++ ONNX inference library built to minimize memory above all else, enabling billion-parameter models to run on devices like the 512MB Raspberry Pi Zero 2.
Is OnnxStream open source?: Yes — vitoplantamura/OnnxStream is an open-source project tracked on heatdrop.
What language is OnnxStream written in?: vitoplantamura/OnnxStream is primarily written in C++.
How popular is OnnxStream?: vitoplantamura/OnnxStream has 2.1k stars on GitHub.
Where can I find OnnxStream?: vitoplantamura/OnnxStream is on GitHub at https://github.com/vitoplantamura/OnnxStream.