A Rust vector index that squeezes 31 GB of float32 embeddings into 4 GB without a training phase, then outruns FAISS on the query.
Inference · Serving
heavyweights · gaining speedVoxCPM2 generates speech directly from text using continuous diffusion, no discrete audio tokens required.
whichllm ranks local models by real benchmark scores, not parameter count, and tells you which ones actually fit your hardware.
OpenMed packages clinical entity extraction and HIPAA-grade de-identification into models small enough for Apple Silicon and impatient DevOps teams.
Cosmos 3 tries to unify video generation, robot action prediction, and physical reasoning inside a single 16B–64B Mixture-of-Transformers architecture.
AirLLM slices giant transformers into layer shards so they fit in consumer VRAM without quantization or distillation.
A Go-based gateway that cross-converts between OpenAI, Claude, and Gemini formats so you don't have to pick sides in the API format wars.
A Chinese speech toolkit that bundles ASR, diarization, emotion detection, and streaming into one MIT-licensed package.
Nango turns natural language into deployable TypeScript integration code, then runs it on managed infrastructure.
A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.
Self-hosted chat UI that unifies OpenAI, Anthropic, Google, AWS, and two dozen other providers under one roof.
LiteLLM is the adapter layer that stops your codebase from fracturing across a dozen provider SDKs.
An open-source gateway for splitting AI subscriptions across teams without breaking native tools.
Voice-Pro bundles Whisper, F5-TTS, CosyVoice, and a dozen other tools into a single Gradio interface for creators who want ElevenLabs-like results without the API bills.
A local proxy that turns sixteen scattered LLM free tiers into one OpenAI-compatible endpoint with automatic failover.
TileRT squeezes millisecond-level latency out of hundred-billion-parameter models by decomposing operators into tile-level tasks and overlapping compute, I/O, and communication across 8 GPUs.
The pi project bundles a CLI coding agent with an unusually paranoid approach to supply-chain security.
A living spreadsheet of which AI providers actually let you call their models for free, with rate limits and gotchas spelled out.
A locally-run frontend that wrangles dozens of LLM APIs, image generators, and TTS into one obsessively customizable interface.
A purpose-built inference and fine-tuning stack that treats M-series chips as first-class citizens instead of afterthoughts.




