Is t2v_metrics open source?

Yes — linzhiqiu/t2v_metrics is open source, released under the Apache-2.0 license.

What language is t2v_metrics written in?

linzhiqiu/t2v_metrics is primarily written in Python.

How popular is t2v_metrics?

linzhiqiu/t2v_metrics has 595 stars on GitHub.

Where can I find t2v_metrics?

linzhiqiu/t2v_metrics is on GitHub at https://github.com/linzhiqiu/t2v_metrics.

← all repositories

linzhiqiu/t2v_metrics

Google’s chosen replacement for CLIPScore is open source

t2v_metrics wraps VQAScore and related benchmarks into a single Python API for judging text-to-image, video, and 3D models, since CLIPScore fumbles compositional prompts.

★595 stars Python Image · Video · Audio LLMOps · Eval

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This library packages VQAScore, ITMScore, and legacy CLIPScore variants behind a single Python API to score generated images, videos, and 3D assets against their text prompts. It also ships with the GenAI-Bench and CameraBench benchmarks for standardized comparisons across models.

The interesting bit

Instead of embedding text and images into the same vector space like CLIP, VQAScore flips the problem: it treats alignment as a visual question-answering task, asking a vision-language model whether a prompt accurately describes a generated asset. This inversion apparently catches compositional details—like which person is “angry” and which is “happy”—that similarity-based metrics smear together, which is partly why Google DeepMind and NVIDIA have adopted the bundled GenAI-Bench benchmark.

Key highlights

Supports more than 20 model families for scoring, ranging from open weights such as Qwen2.5-VL and LLaVA-Video to proprietary APIs including GPT-4o and Gemini.
Bundles GenAI-Bench for compositional image generation and CameraBench for camera-motion understanding in video generators like Kling and Runway.
The underlying CLIP-FlanT5 checkpoints have been downloaded over 2 million times on Hugging Face, per the README.
Also exposes legacy metrics—CLIPScore, PickScore, HPSv2—so you can run the same head-to-head comparisons against prior art.

Caveats

The largest VQAScore models, such as clip-flant5-xxl, recommend 40 GB of VRAM; you’ll need to downsize to smaller variants if your GPU lacks that much memory.
Certain model backends carry conflicting dependencies, so the full model zoo may require isolated environments rather than one tidy venv.

Verdict

If you are training or benchmarking diffusion transformers, video models, or 3D generators, this belongs in your evaluation stack. If you just need a quick CLIP similarity check for a casual side project, it is probably overkill.

Frequently asked

What is linzhiqiu/t2v_metrics?: t2v_metrics wraps VQAScore and related benchmarks into a single Python API for judging text-to-image, video, and 3D models, since CLIPScore fumbles compositional prompts.
Is t2v_metrics open source?: Yes — linzhiqiu/t2v_metrics is open source, released under the Apache-2.0 license.
What language is t2v_metrics written in?: linzhiqiu/t2v_metrics is primarily written in Python.
How popular is t2v_metrics?: linzhiqiu/t2v_metrics has 595 stars on GitHub.
Where can I find t2v_metrics?: linzhiqiu/t2v_metrics is on GitHub at https://github.com/linzhiqiu/t2v_metrics.