michaelfeil/infinity
A high-throughput, low-latency REST API for serving text-embedding, reranking, CLIP, and ColPali models on GPU, CPU, or Apple MPS.

Infinity is a serving engine that provides a REST API for deploying and running ML models including text-embeddings, reranking models, and multimodal models like CLIP and ColPali. It supports multiple inference backends (PyTorch, ONNX/TensorRT via optimum, CTranslate2) and hardware targets including NVIDIA CUDA, AMD ROCM, CPU, AWS Inferentia2, and Apple MPS. The engine uses dynamic batching and per-worker tokenization to maximize throughput.