lucidrains/pi-zero-pytorch
A PyTorch implementation of π₀, the robotic foundation model from Physical Intelligence that combines flow-matching with vision-language model components for robot action prediction.

This repository reproduces the π₀ architecture proposed by Physical Intelligence, serving as a simplified Transfusion model with influences from Stable Diffusion 3. It uses flow matching instead of diffusion for policy generation and adopts joint attention from mmDIT. The model takes vision inputs, language commands, and joint state to output robot actions, building on a pretrained PaliGemma 2B vision-language model backbone. The architecture employs Flex Attention to mix autoregressive and bidirectional attention patterns across different token types.