Is LLM4Decompile open source?

Yes — albertan017/LLM4Decompile is open source, released under the MIT license.

What language is LLM4Decompile written in?

albertan017/LLM4Decompile is primarily written in Python.

How popular is LLM4Decompile?

albertan017/LLM4Decompile has 6.8k stars on GitHub.

Where can I find LLM4Decompile?

albertan017/LLM4Decompile is on GitHub at https://github.com/albertan017/LLM4Decompile.

← all repositories

albertan017/LLM4Decompile

Teaching LLMs to reverse-engineer x86_64 binaries

An open-source project that fine-tunes large language models to decompile Linux x86_64 binaries into C code, validating the results by checking whether they re-execute and pass their original tests.

★6.8k stars Python Language Models Domain Apps

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does LLM4Decompile is a family of open-source language models trained to reverse-engineer compiled code back into C. It takes disassembled Linux x86_64 binaries—either cleaned objdump output or Ghidra pseudo-code—and generates human-readable source. Quality is measured by re-executability: the decompiled code must compile and pass its original test assertions.

The interesting bit The team treats decompilation as a neural translation problem rather than classical static analysis, building custom training corpora of two million binary-source pairs and a benchmark that scores outputs by whether they actually run. They also provide a full replication pipeline that trains in roughly 3.5 hours on a single A100 40G GPU for under $20.

Key highlights

Two model families: LLM4Decompile-End translates assembly directly into C, while LLM4Decompile-Ref polishes Ghidra’s pseudo-code.
Parameter counts span 1.3B to 22B; the llm4decompile-9b-v2 model currently leads with a 64.9% re-executability rate on the Decompile benchmark.
The released decompile-bench dataset contains two million binary-source function pairs for training and 70K for evaluation.
A quick-replication training script achieves a 0.26 re-executability rate in ~3.5 hours on one A100 40G, costing less than $20.
The newer SK²Decompile pipeline splits work into a structure-recovery “skeleton” phase and an identifier-naming “skin” phase.

Caveats

Support is currently limited to Linux x86_64 binaries compiled with GCC at optimization levels O0–O3; other architectures and compilers remain on the roadmap.
Performance varies sharply by model size and training budget—the cheap replication model scores around 26% re-executability, well below the flagship 9B model.

Verdict Reverse engineers who want LLM-assisted cleanup of Ghidra output or a head start on reconstructing C from assembly should try it. If you need broad architecture support or guaranteed correctness, wait for the next revision.

Frequently asked

What is albertan017/LLM4Decompile?: An open-source project that fine-tunes large language models to decompile Linux x86_64 binaries into C code, validating the results by checking whether they re-execute and pass their original tests.
Is LLM4Decompile open source?: Yes — albertan017/LLM4Decompile is open source, released under the MIT license.
What language is LLM4Decompile written in?: albertan017/LLM4Decompile is primarily written in Python.
How popular is LLM4Decompile?: albertan017/LLM4Decompile has 6.8k stars on GitHub.
Where can I find LLM4Decompile?: albertan017/LLM4Decompile is on GitHub at https://github.com/albertan017/LLM4Decompile.