Is tevatron open source?

Yes — texttron/tevatron is open source, released under the Apache-2.0 license.

What language is tevatron written in?

texttron/tevatron is primarily written in Python.

How popular is tevatron?

texttron/tevatron has 742 stars on GitHub.

Where can I find tevatron?

texttron/tevatron is on GitHub at https://github.com/texttron/tevatron.

← all repositories

texttron/tevatron

One retrieval toolkit, two frameworks, zero patience for glue code

Tevatron trains billion-scale neural retrievers on both PyTorch/GPU and JAX/TPU without making you rewrite your pipeline.

★742 stars Python RAG · Search ML Frameworks Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does Tevatron is a training and inference toolkit for dense document retrieval. It fine-tunes LLMs (via LoRA) into embedding models, encodes queries and passages into vectors, then runs similarity search. The same workflow runs on PyTorch with DeepSpeed/FlashAttention/vLLM, or on JAX with TPU/GPU support.

The interesting bit The project ships self-contained HuggingFace datasets and directly loads SoTA pretrained embedders like BGE and E5. The JAX path is genuinely separate—not a wrapper—using custom tevax modules with chunked gradient caching (grad_cache, passage_num_chunks) to fit large models on TPU. The PyTorch path is more conventional but plugs into the familiar ecosystem.

Key highlights

LoRA fine-tuning of 7B+ retrievers with a specific target module list (q/k/v/o projections plus MLP gates)
Dual-stack: PyTorch (DeepSpeed ZeRO-3, bf16, gradient checkpointing) or JAX (mesh parallelism, TPU-optimized)
Self-contained HF datasets for multilingual and multi-modal retrieval; image fields optional in JSONL schema
Direct encoding→retrieval pipeline outputs standard query_id passage_id score ranking files
Example claims MRR@10=42.3 on MS MARCO with “straightforward training” of Mistral-7B

Caveats

v1 features not yet migrated to v2; README warns users to pull the v1 branch if they need legacy functionality
JAX GPU setup requires NVIDIA’s jax-toolbox container or careful manual dependency resolution (magix, GradCache from separate repos)
Training times are substantial: ~70 hours on 4×A6000 (PyTorch) or ~35 hours on v4-8 TPU (JAX) for the MS MARCO example

Verdict Worth a look if you’re doing research-scale dense retrieval and want to switch between GPU and TPU without rebuilding your pipeline. Skip if you need stable production APIs or are allergic to pip install -e . from multiple git repos.

Frequently asked

What is texttron/tevatron?: Tevatron trains billion-scale neural retrievers on both PyTorch/GPU and JAX/TPU without making you rewrite your pipeline.
Is tevatron open source?: Yes — texttron/tevatron is open source, released under the Apache-2.0 license.
What language is tevatron written in?: texttron/tevatron is primarily written in Python.
How popular is tevatron?: texttron/tevatron has 742 stars on GitHub.
Where can I find tevatron?: texttron/tevatron is on GitHub at https://github.com/texttron/tevatron.