thu-ml/RoboticsDiffusionTransformer
A 1-billion parameter diffusion transformer trained on 1M+ robot episodes that predicts bimanual manipulation actions from language instructions and RGB images.

RDT-1B is a foundation model for robot manipulation that uses a diffusion transformer architecture to generate sequences of robot actions. Given natural language instructions and multi-view RGB observations, the model predicts next actions for dual-arm robotic systems. It is pre-trained on over 1 million multi-robot episodes and can be fine-tuned for specific bimanual tasks. The implementation includes PyTorch model code, training scripts with DeepSpeed, pre-trained checkpoints on HuggingFace, and real-robot deployment examples.