Is TileRT open source?

Yes — tile-ai/TileRT is open source, released under the MIT license.

What language is TileRT written in?

tile-ai/TileRT is primarily written in Python.

How popular is TileRT?

tile-ai/TileRT has 1.6k stars on GitHub.

Where can I find TileRT?

tile-ai/TileRT is on GitHub at https://github.com/tile-ai/TileRT.

← all repositories

tile-ai/TileRT

A runtime that treats LLM inference like a tiling problem

TileRT squeezes millisecond-level latency out of hundred-billion-parameter models by decomposing operators into tile-level tasks and overlapping compute, I/O, and communication across 8 GPUs.

★1.6k stars Python Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

TileRT is an inference runtime optimized for ultra-low-latency scenarios — think high-frequency trading, interactive AI, and real-time coding assistants — rather than throughput-heavy batch processing. It targets millisecond-level time-per-output-token for models like DeepSeek-V3.2 and GLM-5 using a compiler-driven approach that breaks LLM operators into fine-grained tiles, then dynamically reschedules them across an 8-GPU node to keep hardware busy.

The interesting bit

The “tile-level runtime engine” is the core trick: instead of monolithic kernel launches, computation, I/O, and communication are decomposed and overlapped at tile granularity. The project claims this minimizes idle time without sacrificing model size or quality. It’s also already in production — GLM-5.1-highspeed on Z.ai runs on TileRT, so this isn’t purely researchware.

Key highlights

Supports DeepSeek-V3.2 and GLM-5 (FP8), with Multi-Token Prediction (MTP) that reportedly hits ~590 tokens/s under synthetic workloads
Ships as a pre-built binary wheel with hard-pinned dependencies: Python 3.12, PyTorch 2.11.0+cu130, CUDA 13.2, 8× NVIDIA B200
Includes a weight converter that rewrites official Hugging Face checkpoints into per-device shards (*_dev_{0..7}) for direct runtime loading
Two independent backend libraries (libtilert_dsv32.so and libtilert_glm5.so) — only one can load per Python process
Underlying compiler techniques planned for open-source release through TileLang and TileScale

Caveats

The wheel is strictly ABI-locked; the README explicitly warns that other Python, CUDA, or PyTorch combinations are “untested and not guaranteed to work”
Only 8× B200 is the validated hardware target; portability to other GPU configurations is unclear
Only two models are currently supported, and they cannot run in the same process

Verdict

Worth watching if you run latency-sensitive LLM services at scale and have the B200 fleet to match. For researchers on commodity hardware or anyone wanting model flexibility, this is currently a very expensive, very narrow tool.

Frequently asked

What is tile-ai/TileRT?: TileRT squeezes millisecond-level latency out of hundred-billion-parameter models by decomposing operators into tile-level tasks and overlapping compute, I/O, and communication across 8 GPUs.
Is TileRT open source?: Yes — tile-ai/TileRT is open source, released under the MIT license.
What language is TileRT written in?: tile-ai/TileRT is primarily written in Python.
How popular is TileRT?: tile-ai/TileRT has 1.6k stars on GitHub.
Where can I find TileRT?: tile-ai/TileRT is on GitHub at https://github.com/tile-ai/TileRT.