Is Chat-UniVi open source?

Yes — PKU-YuanGroup/Chat-UniVi is open source, released under the Apache-2.0 license.

What language is Chat-UniVi written in?

PKU-YuanGroup/Chat-UniVi is primarily written in Python.

How popular is Chat-UniVi?

PKU-YuanGroup/Chat-UniVi has 943 stars on GitHub.

Where can I find Chat-UniVi?

PKU-YuanGroup/Chat-UniVi is on GitHub at https://github.com/PKU-YuanGroup/Chat-UniVi.

← all repositories

PKU-YuanGroup/Chat-UniVi

One visual vocabulary for images and video

It builds a single dynamic tokenizer so one Vicuna-based model can reason over both still frames and motion without modality-specific plumbing.

★943 stars Python Language Models Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Chat-UniVi is a multimodal LLM built on Vicuna that reads both images and video through one shared visual encoder. It represents either modality as a set of dynamic visual tokens—fewer tokens than typical approaches, but supposedly sufficient to capture spatial detail in photos and temporal relationships in clips. The 7B and 13B checkpoints are trained jointly on mixed image-and-video data, so the same weights handle both tasks without architectural switching.

The interesting bit

The project’s real pitch is efficiency through unification: the authors claim the 13B model trains from scratch in full parameters on eight A100 GPUs in about three days. In a field where multimodal training often requires warehouse-scale compute, that is a modest hardware bill—assuming the number holds up.

Key highlights

Single model processes both images and video; no separate pipelines or inference-time switches.
Dynamic visual tokens compress both spatial image detail and temporal video context into a lean sequence.
13B variant trains in full parameters on 8× A100 GPUs in roughly 3 days, per the authors.
Accepted as a CVPR 2024 Highlight (top 3% of 11,532 submissions).
Handles variable-length videos without zero-padding.

Caveats

The authors revised their video temporal evaluation score downward in April 2024, from 57.8 to 47.9, after an earlier reporting error.
Chinese conversation support is currently bolted on via a translation API, not native multilingual training.

Verdict

A sensible choice if you want one open-weight model for mixed image-and-video QA without maintaining two separate stacks. Look elsewhere if you need a fully native multilingual model or if the revised video benchmarks fall below your threshold.

Frequently asked

What is PKU-YuanGroup/Chat-UniVi?: It builds a single dynamic tokenizer so one Vicuna-based model can reason over both still frames and motion without modality-specific plumbing.
Is Chat-UniVi open source?: Yes — PKU-YuanGroup/Chat-UniVi is open source, released under the Apache-2.0 license.
What language is Chat-UniVi written in?: PKU-YuanGroup/Chat-UniVi is primarily written in Python.
How popular is Chat-UniVi?: PKU-YuanGroup/Chat-UniVi has 943 stars on GitHub.
Where can I find Chat-UniVi?: PKU-YuanGroup/Chat-UniVi is on GitHub at https://github.com/PKU-YuanGroup/Chat-UniVi.