Yes — sail-sg/Adan is open source, released under the Apache-2.0 license.

What language is Adan written in?

sail-sg/Adan is primarily written in Python.

sail-sg/Adan has 819 stars on GitHub.

Where can I find Adan?

sail-sg/Adan is on GitHub at https://github.com/sail-sg/Adan.

sail-sg/Adan

An optimizer that dares you to 10× the learning rate

Adan swaps Adam/AdamW for an adaptive Nesterov momentum rule that often tolerates learning rates five to ten times higher.

★819 stars Python ML Frameworks Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does Adan is a first-order optimizer implemented in PyTorch that aims to speed up deep model training by combining adaptive gradient scaling with Nesterov momentum. It is designed as a near drop-in replacement for Adam and AdamW, and the repository collects reproduction recipes for vision transformers, ResNets, BERT, GPT-2, and mixture-of-experts language models. The authors also ship a fused CUDA kernel variant that trims memory overhead.

The interesting bit Most optimizers explode if you raise the peak learning rate too aggressively; Adan seems to invite it, often running stably at rates five to ten times higher than Adam/AdamW. That property, plus a third-order moment estimate baked into the update rule, appears to be why it can match GPT-2 345M performance in roughly half the training steps.

Key highlights

Broad framework adoption: already integrated into NVIDIA NeMo, Hugging Face timm, Meta’s D-Adaptation, OpenMMLab, DeepMind Optax, and Baidu Paddle, and used as the default optimizer in DreamFusion and MDT V2.
Published in IEEE TPAMI (2024) with pre-training results on MoE models (up to 8×0.6B parameters) and GPT-2 345M showing lower perplexity or comparable pass@1 with 50 % fewer steps.
Ships with FusedAdan for reduced memory footprint, and the authors note that distributed sharding via ZeroRedundancyOptimizer brings per-GPU memory in line with AdamW.
Exposes a no_prox flag so you can choose between the paper’s original weight-decay update rule and an AdamW-style decoupled decay.

Caveats

On a single GPU, Adan stores slightly more optimizer state than Adam/AdamW, so you may need the fused implementation or distributed sharding to fit the same batch size.
The code’s betas arguments are actually (1 − β_paper), a quirk the README warns about but which is easy to miss when porting hyper-parameters from the paper.
The vision-model results table in the README is truncated, so the full ImageNet sweep is not visible in the rendered documentation.

Verdict Try it if you are training vision transformers, LLMs, or diffusion models from scratch and want a widely integrated alternative to AdamW. Skip it if you are already memory-bound on a single GPU and cannot use the fused kernel or distributed optimizers.

Frequently asked

What is sail-sg/Adan?: Adan swaps Adam/AdamW for an adaptive Nesterov momentum rule that often tolerates learning rates five to ten times higher.
Is Adan open source?: Yes — sail-sg/Adan is open source, released under the Apache-2.0 license.
What language is Adan written in?: sail-sg/Adan is primarily written in Python.
How popular is Adan?: sail-sg/Adan has 819 stars on GitHub.
Where can I find Adan?: sail-sg/Adan is on GitHub at https://github.com/sail-sg/Adan.