Is text-embeddings-inference open source?

Yes — huggingface/text-embeddings-inference is open source, released under the Apache-2.0 license.

What language is text-embeddings-inference written in?

huggingface/text-embeddings-inference is primarily written in Rust.

How popular is text-embeddings-inference?

huggingface/text-embeddings-inference has 5k stars on GitHub.

Where can I find text-embeddings-inference?

huggingface/text-embeddings-inference is on GitHub at https://github.com/huggingface/text-embeddings-inference.

← all repositories

huggingface/text-embeddings-inference

Serving embeddings without the PyTorch compilation nap

A Rust toolkit for serving open-source embedding and reranking models that skips graph compilation and boots fast enough to feel serverless.

★5k stars Rust Inference · Serving RAG · Search

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does Text Embeddings Inference (TEI) is a Rust-based server for open-source text embeddings, reranking, and sequence classification. It bundles optimized kernels like Flash Attention and cuBLASLt into a compact container that exposes an HTTP or gRPC API, loading weights from Safetensors or ONNX. The project supports a broad catalog of architectures, from classic BERT and MPNet to newer Qwen3 and Gemma3 checkpoints.

The interesting bit The standout feature is the absence of a model graph compilation step, which TEI pairs with small images to target serverless cold starts. It batches requests dynamically by token count rather than fixed payload sizes, though you still have to set the max-batch-tokens ceiling manually.

Key highlights

Supports embeddings, reranking, and sequence classification across dozens of model families including JinaBERT, Mistral, Alibaba GTE, and ModernBERT.
Runs on NVIDIA GPUs via CUDA, Apple Silicon via Metal, and offers experimental ROCm support for AMD Instinct cards.
Production-ready telemetry via OpenTelemetry tracing and Prometheus metrics.
Runs fully offline in air-gapped environments without phoning home to the Hugging Face hub.
Optional SPLADE pooling for sparse lexical retrieval.

Caveats

AMD Instinct GPU support via ROCm is explicitly labeled experimental.
The max-batch-tokens ceiling is left to the operator; the README warns that the tool cannot infer the optimal limit automatically.

Verdict Worth evaluating if you operate embedding or reranking microservices where boot latency and container size matter. Skip it if you need generative decoding or a model architecture outside the supported list.

Frequently asked

What is huggingface/text-embeddings-inference?: A Rust toolkit for serving open-source embedding and reranking models that skips graph compilation and boots fast enough to feel serverless.
Is text-embeddings-inference open source?: Yes — huggingface/text-embeddings-inference is open source, released under the Apache-2.0 license.
What language is text-embeddings-inference written in?: huggingface/text-embeddings-inference is primarily written in Rust.
How popular is text-embeddings-inference?: huggingface/text-embeddings-inference has 5k stars on GitHub.
Where can I find text-embeddings-inference?: huggingface/text-embeddings-inference is on GitHub at https://github.com/huggingface/text-embeddings-inference.