google-research/pix2seq
Google Research implementation of Pix2Seq, a TensorFlow 2 framework that converts RGB images into semantic sequences for object detection and other vision tasks.

Pix2Seq is a multi-task computer vision framework that reformulates tasks like object detection as sequence generation problems. The codebase supports both autoregressive decoding and diffusion-based generative models (including FitTransformer, Bit Diffusion, and RIN). Built in TensorFlow 2 with efficient TPU/GPU support, it provides pretrained checkpoints for tasks like object detection on datasets like Objects365 and provides Colab notebooks for inference.