When LLMs get root: a pen-testing framework that runs on curiosity
A Python framework that lets security researchers spin up autonomous LLM hacking agents in ~50 lines of code.

What it does
HackingBuddyGPT is a Python framework for building LLM-driven penetration-testing agents. It handles the plumbing—SSH or local shell connections, LLM API wrangling, logging to SQLite, round limits—so researchers can focus on writing attack logic. The flagship use-case tasks an LLM with escalating from low-privilege user to root on a Linux system, running commands autonomously until it succeeds or hits a timeout.
The interesting bit
The “50 lines of code” pitch is the hook, but the real value is the benchmark infrastructure. The team maintains reusable Linux privilege-escalation benchmarks and publishes open-access papers comparing LLM performance, turning what could be a toy into a reproducible research platform. It also won a spot in GitHub Accelerator 2024.
Key highlights
- Minimal agent skeleton is genuinely short; the README shows a working Linux priv-esc agent in a single Python class
- Supports both remote SSH targets and local shell execution (with appropriate warnings about running untrusted LLM-generated commands on your own machine)
- Includes extended variants with RAG and chain-of-thought for more sophisticated experiments
- Web pentest and web API testing agents exist but are marked pre-alpha/WIP
- Active academic backing: two published papers, conference presentations at ESEC/FSE and ESSAI
Caveats
- Web and web-api use-cases are in “heavy development and pre-alpha stage” per the README
- The framework executes live commands on real systems; the authors explicitly warn about data loss and system modification risks
Verdict
Worth a look for security researchers or red-teamers experimenting with LLM autonomy, especially if you need reproducible benchmarks. Skip it if you want polished, production-ready web testing tools—those aren’t here yet.