Haiyang-W/TokenFormer
A transformer architecture that tokenizes model parameters and uses attention for both input tokens and learned parameter tokens to enhance scaling flexibility.

Velocity · 7d
+1.0
★ / day
Trend
→steady
star history
TokenFormer replaces fixed linear projection layers with a parameter-efficient attention mechanism that treats model parameters as learnable tokens. This allows the same architecture to scale computations and model parameters jointly through a unified attention framework. The implementation supports training from scratch and provides pre-trained checkpoints for foundation model applications.