A Rust vector index that squeezes 31 GB of float32 embeddings into 4 GB without a training phase, then outruns FAISS on the query.
Inference · Serving
newcomers · gaining speedwhichllm ranks local models by real benchmark scores, not parameter count, and tells you which ones actually fit your hardware.
VoxCPM2 generates speech directly from text using continuous diffusion, no discrete audio tokens required.
OpenMed packages clinical entity extraction and HIPAA-grade de-identification into models small enough for Apple Silicon and impatient DevOps teams.
Cosmos 3 tries to unify video generation, robot action prediction, and physical reasoning inside a single 16B–64B Mixture-of-Transformers architecture.
AirLLM slices giant transformers into layer shards so they fit in consumer VRAM without quantization or distillation.
A Go-based gateway that cross-converts between OpenAI, Claude, and Gemini formats so you don't have to pick sides in the API format wars.
A Chinese speech toolkit that bundles ASR, diarization, emotion detection, and streaming into one MIT-licensed package.
Nango turns natural language into deployable TypeScript integration code, then runs it on managed infrastructure.
Self-hosted chat UI that unifies OpenAI, Anthropic, Google, AWS, and two dozen other providers under one roof.
A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.
TileRT squeezes millisecond-level latency out of hundred-billion-parameter models by decomposing operators into tile-level tasks and overlapping compute, I/O, and communication across 8 GPUs.
LiteLLM is the adapter layer that stops your codebase from fracturing across a dozen provider SDKs.
Voice-Pro bundles Whisper, F5-TTS, CosyVoice, and a dozen other tools into a single Gradio interface for creators who want ElevenLabs-like results without the API bills.
A living spreadsheet of which AI providers actually let you call their models for free, with rate limits and gotchas spelled out.
LUPINE fakes the CUDA driver so CPU-only machines can rent remote GPUs as if they were local.
Manifest picks the cheapest model that can handle each query, mixing API keys, subscriptions, and local hardware in one endpoint.
A locally-run frontend that wrangles dozens of LLM APIs, image generators, and TTS into one obsessively customizable interface.
A purpose-built inference and fine-tuning stack that treats M-series chips as first-class citizens instead of afterthoughts.
MLX-VLM crams speculative decoding, continuous batching, and KV cache quantization into a Mac-native toolkit for running multimodal models locally.






