liruilong940607/prope
A positional encoding method for multi-view vision transformers that encodes 3D camera relationships into image tokens via relative projective transformation.

PRoPE presents camera conditioning techniques for vision transformers, using relative projective transformation to encode 3D spatial relationships between image tokens across multiple camera views. The approach includes absolute positional encodings (raymaps), relative pose encodings, and the PRoPE method for binding camera parameters to transformer inputs. Implementations are provided in standalone JAX and PyTorch modules for integration into existing transformer architectures.