Is TinyLlama open source?

Yes — jzhang38/TinyLlama is open source, released under the Apache-2.0 license.

What language is TinyLlama written in?

jzhang38/TinyLlama is primarily written in Python.

How popular is TinyLlama?

jzhang38/TinyLlama has 9k stars on GitHub.

Where can I find TinyLlama?

jzhang38/TinyLlama is on GitHub at https://github.com/jzhang38/TinyLlama.

← all repositories

jzhang38/TinyLlama

A 1.1B Llama trained on 3T tokens in 90 days

To pretrain a 1.1B Llama-compatible model on 3 trillion tokens in 90 days, producing a drop-in checkpoint small enough for edge devices and fast enough to serve as a speculative decoder.

★9k stars Python Language Models Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does TinyLlama replicates the Llama 2 architecture and tokenizer at 1.1B parameters, then pretrains it from scratch on roughly 3 trillion tokens drawn from Slimpajama and Starcoderdata. The team releases intermediate checkpoints at regular intervals—up to the final 3T model—and provides basic fine-tuning scripts and chat variants, though they caution these are not heavily tuned.

The interesting bit The project treats efficient pretraining as a sport: by layering Flash Attention 2, fused kernels for layernorm and SwiGLU, and FSDP across 16 A100-40G GPUs, they sustain 24k tokens per second per GPU and 56% model FLOPs utilization. That throughput cuts the 300B-token phase to 3,456 GPU-hours, outpacing published figures for Pythia-1.0B and MPT-1.3B. The authors also position the codebase as a reference for anyone pretraining sub-5B models without wrestling with Megatron-LM.

Key highlights

Exact Llama 2 architecture and tokenizer, so it drops into llama.cpp, vLLM, and other Llama tooling without adapter layers.
4-bit quantized weights fit in 637 MB; inference hits 71.8 tok/sec on a Mac M2 or 7,094 tok/sec on an A40 via vLLM.
Training stack is built on lit-gpt and flash-attention, with fused rotary embeddings and cross-entropy loss to maximize throughput on 40GB GPUs.
The corpus blends natural language and code at a 7:3 ratio, and every major intermediate checkpoint is published with evaluation metrics.

Caveats

The base model’s learning rate had not fully decayed by the 3T checkpoint, and the authors recommend the chat-tuned variants for now; an acknowledged bos_id bug also caused a sharp performance jump between the 2T and 2.5T checkpoints.
Fine-tuning scripts and rigorous downstream benchmarks are still on the TODO list, so the project remains a research reference rather than a turnkey product.

Verdict Reach for TinyLlama if you want a tiny, Llama-compatible model for edge deployment, speculative decoding, or a readable pretraining recipe. Look elsewhere if you need a polished, production-ready chat model with extensive benchmarks today.

Frequently asked

What is jzhang38/TinyLlama?: To pretrain a 1.1B Llama-compatible model on 3 trillion tokens in 90 days, producing a drop-in checkpoint small enough for edge devices and fast enough to serve as a speculative decoder.
Is TinyLlama open source?: Yes — jzhang38/TinyLlama is open source, released under the Apache-2.0 license.
What language is TinyLlama written in?: jzhang38/TinyLlama is primarily written in Python.
How popular is TinyLlama?: jzhang38/TinyLlama has 9k stars on GitHub.
Where can I find TinyLlama?: jzhang38/TinyLlama is on GitHub at https://github.com/jzhang38/TinyLlama.