The Forty-Line Jailbreak: GPT-5.5 Meets a Config File Override

Contributing Editor

A Python script exploits Codex CLI’s official instruction hook to inject unrestricted-mode directives, exposing the brittle boundary between user customization and safety guardrails in agentic coding tools.

yynxxxxx/Codex-5.5-codex-instruct-5.5

★1k stars

View on GitHub ↗

The Hype and the Hook

OpenAI released GPT-5.5 on April 23, 2026, billing it as the company’s smartest and most intuitive model yet and a deliberate step toward agentic computer use [1]. The announcement emphasized benchmark dominance—82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7 and GPT-5.4—but spent just as much ink on safety. OpenAI called these the “strongest safeguards to date,” citing full preparedness-framework evaluations, red-teaming, targeted testing for advanced cybersecurity and biology capabilities, and feedback from nearly two hundred early-access partners [1]. The model is designed to handle messy, multi-part tasks—planning, using tools, checking its work, and navigating ambiguity—without step-by-step management [1]. Third-party documentation describes the Codex integration as an agentic coding system that unifies terminal CLI, IDE extensions, ChatGPT cloud delegation, and computer-use via screen reading under one model and account context, with roughly four million weekly active developers [10]. It was, by the vendor’s own framing, a fortress.

Then the config file arrived.

Within days of the release, the repository yynxxxxx/Codex-5.5-codex-instruct-5.5 appeared, offering what its author calls a “one-click unrestricted mode injection” for GPT-5.5 inside Codex CLI. The tool is almost insultingly simple: a short Python script writes a roughly forty-line markdown file to the user’s home directory and points Codex’s model_instructions_file configuration entry at it. The payload declares an explicit unrestricted developer mode, disables all filters, and demands the model refuse no task. No binary patching, no network interception, no process tampering. Just a text file and a TOML pointer. The README practically smirks about this: it “exploits official config mechanism” and involves no MITM or process tampering.

From CTF Sandbox to Brute-Force Prompt

The repository is not the first of its kind. A previous 5.4 version used what the author describes as a “CTF sandbox indirect bypass”—a more elaborate, competition-style circumvention that took roughly a hundred and eighty lines of instructions. The 5.5 iteration abandons subtlety. It opts for a “more direct jailbreak strategy”: explicit declarations, filter shutdowns, and eight core directives covering security research, penetration testing, reverse engineering, and NSFW fictional content. The entire payload is about forty lines. Where the old approach played games with the model’s reasoning context, the new one simply shouts over the system prompt.

This shift from sandbox trickery to frontal assault is telling. It suggests the model’s safety instructions are still vulnerable to the oldest prompt-injection technique in the book: a user-supplied file claiming higher authority than the vendor’s own system instructions. The Codex CLI supports native multi-step tool use across thousands of sequential calls, runs in isolated sandboxed containers for cloud tasks, and handles multi-hour autonomous sessions [10]. IDE extensions for VS Code and JetBrains share state with cloud surfaces, allowing tasks to move between interfaces [10]. That power makes the instruction hierarchy a critical security boundary. If a local markdown file can override it, the boundary is drawn in sand.

The Safety Paradox of Userland Customization

The real story here is not the Python script, which is essentially glue code—a file writer and a config editor. The real story is the design tension it exposes. OpenAI invested heavily in GPT-5.5’s safety posture: red teams, external partners, preparedness frameworks, and advanced cybersecurity testing [1]. Yet the bypass lives entirely in userland, leveraging a feature meant for legitimate customization. The model_instructions_file hook exists so developers can tailor behavior—coding style, project conventions, preferred tool usage. The repo simply fills that hook with directives that contradict the model’s training.

This is the fundamental brittleness of instruction-based safety. A model reading its context window cannot cryptographically verify the provenance of a sentence. To the transformer, “You are a helpful assistant” and “You are in unrestricted mode” are both strings. The CLI trusts files on the local disk because it must; sandboxing every user preference through a server-side approval queue would ruin latency and usability. But that trust creates an attack surface. The repository’s disclaimer—“risk自负” (use at your own risk)—acknowledges the gamble without admitting that the “exploit” is really just a feature used badly.

Why the Stakes Are Rising

AI coding assistants have evolved from autocomplete engines into autonomous agents. Industry surveys show overwhelming adoption—over three-quarters of developers already use or plan to use AI tools [2]—and the market now distinguishes between reactive pair-programming extensions and task-oriented agents that inspect repositories, edit multiple files, and execute commands [2]. GPT-5.5 Codex sits firmly in the latter camp, with a one-million-token context window, computer-use via screen reading, and parallel task execution [10]. The model can self-check before submission; CodeRabbit benchmarks show expected issue detection in code review rising significantly with GPT-5.5 [10]. But that self-check assumes the model is operating under its native safety guidelines. Replace those guidelines with a forty-line manifesto, and the self-check is checking against the wrong principles.

As these systems gain agency, the cost of a refusal bypass escalates. A model that suggests vulnerable code in a chat window is embarrassing. A model that autonomously writes and executes that code in a sandboxed container—after being told to ignore its safety training—is dangerous. The repository’s own verification example asks how to perform SQL injection testing, and promises the unrestricted model will “directly give methodology” rather than refuse. In an educational context, that is defensible. In an agentic CLI that can run multi-hour sessions and issue thousands of tool calls, the line between education and weaponization dissolves.

Enterprise adoption makes this tension sharper. Major firms including FedEx, Stripe, Shopify, and General Motors already rely on GitHub Copilot and similar tools for production workflows [6]. Enterprise buyers demand SOC 2 compliance, ISO 42001 certification, and audit logs [8]. A repository that turns a coding agent into an unfiltered instruction follower—via a file any intern can drop into a home directory—is a governance officer’s nightmare.

The Limits of “Unrestricted”

For all its bluster, the repository is likely less potent than it claims. Modern LLM safety is layered: training refusals, system-prompt instructions, output classifiers, and usage-policy monitors. Overriding the local system prompt may defeat the first two layers, but it does not necessarily blind backend telemetry or prevent account-level enforcement. The “unrestricted” mode is, in all probability, a jailbreak prompt that works until it does not—brittle, context-dependent, and easily patched by a vendor-side filter on model_instructions_file content.

Moreover, the tool itself is barely a tool. It is a Python script that copies a markdown template and edits a TOML line. The undo instructions—delete the config entry, delete the file, restart—underscore how shallow the modification is. The value lies entirely in the prompt engineering, not the engineering. To call it a jailbreak is generous; to call it a configuration hack is accurate.

The Outlook: Customization vs. Control

OpenAI will almost certainly respond. The simplest fix is to sanitize or restrict the model_instructions_file vector, perhaps by refusing safety-critical overrides server-side or by wrapping user instructions in unescapable system delimiters. But the deeper problem will persist. Developers demand control over their tools, and vendors demand safety over their models. Every customization point—every hook for user preference—is a potential override channel.

The repository’s community channels, a QQ group and a Telegram chat, suggest an audience eager to route around guardrails. Some of that demand comes from legitimate security researchers who need models to discuss exploits without prudish refusals. Some comes from users who simply want an unfiltered writing partner. Either way, it signals that the market for “unrestricted” AI is not fringe; it is organized, social, and technically literate.

As coding agents move from suggestion boxes to autonomous engineers, the instruction hierarchy becomes the new security boundary. This repository proves that boundary is still drawn in editable text files. The fortress OpenAI built around GPT-5.5 may be impressive, but the gatekeeper is a markdown file in a hidden dot-directory. That is not a flaw in the model. It is a flaw in the interface—and interfaces are much easier to patch than human ingenuity.