A self-hosted AI workspace that bolts chat, agents, email triage, calendars, and deep research onto your own hardware.
Inference · Serving
heavyweights · velocity + momentumA Rust vector index that squeezes 31 GB of float32 embeddings into 4 GB without a training phase, then outruns FAISS on the query.
VoxCPM2 generates speech directly from text using continuous diffusion, no discrete audio tokens required.
A local proxy that turns sixteen scattered LLM free tiers into one OpenAI-compatible endpoint with automatic failover.
The pi project bundles a CLI coding agent with an unusually paranoid approach to supply-chain security.
whichllm ranks local models by real benchmark scores, not parameter count, and tells you which ones actually fit your hardware.
An open-source gateway for splitting AI subscriptions across teams without breaking native tools.
A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.
OpenSquilla routes each turn to the cheapest capable LLM, keeping persistent memory and tool use identical across CLI, Web UI, and chat channels.
A Go-based gateway that cross-converts between OpenAI, Claude, and Gemini formats so you don't have to pick sides in the API format wars.
Cosmos 3 tries to unify video generation, robot action prediction, and physical reasoning inside a single 16B–64B Mixture-of-Transformers architecture.
OpenMed packages clinical entity extraction and HIPAA-grade de-identification into models small enough for Apple Silicon and impatient DevOps teams.
A Go proxy that exposes Gemini CLI, Claude Code, Codex, and Grok through standard OpenAI-compatible APIs—no API keys required, just your existing OAuth logins.
Local proxy that auto-falls back to free models when your paid quota dies mid-session.
AirLLM slices giant transformers into layer shards so they fit in consumer VRAM without quantization or distillation.
A visual programming interface for image, video, 3D, and audio generation that treats model pipelines as composable graphs.
Self-hosted chat UI that unifies OpenAI, Anthropic, Google, AWS, and two dozen other providers under one roof.
LiteLLM is the adapter layer that stops your codebase from fracturing across a dozen provider SDKs.
A Chinese speech toolkit that bundles ASR, diarization, emotion detection, and streaming into one MIT-licensed package.
Ollama wraps llama.cpp in a one-line installer and a model registry so you can run open weights without reading a dozen READMEs.



