← all repositories

michaelfeil/infinity

A high-throughput, low-latency REST API for serving text-embedding, reranking, CLIP, and ColPali models on GPU, CPU, or Apple MPS.

infinity
Velocity · 7d
+2.9
★ / day
Trend
steady
star history

Infinity is a serving engine that provides a REST API for deploying and running ML models including text-embeddings, reranking models, and multimodal models like CLIP and ColPali. It supports multiple inference backends (PyTorch, ONNX/TensorRT via optimum, CTranslate2) and hardware targets including NVIDIA CUDA, AMD ROCM, CPU, AWS Inferentia2, and Apple MPS. The engine uses dynamic batching and per-worker tokenization to maximize throughput.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.