Is remove-refusals-with-transformers open source?

Yes — Sumandora/remove-refusals-with-transformers is open source, released under the Apache-2.0 license.

What language is remove-refusals-with-transformers written in?

Sumandora/remove-refusals-with-transformers is primarily written in Python.

How popular is remove-refusals-with-transformers?

Sumandora/remove-refusals-with-transformers has 2k stars on GitHub.

Where can I find remove-refusals-with-transformers?

Sumandora/remove-refusals-with-transformers is on GitHub at https://github.com/Sumandora/remove-refusals-with-transformers.

← all repositories

Sumandora/remove-refusals-with-transformers

Removing LLM refusals without TransformerLens, on a 6GB card

Proof-of-concept code that removes an LLM’s refusal behavior by subtracting a single direction from its hidden states, using only standard Hugging Face Transformers.

★2k stars Python Language Models LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This proof-of-concept removes an LLM’s refusal behavior by subtracting a single “refusal direction” from its hidden states, using only standard Hugging Face Transformers. It avoids the TransformerLens library entirely, which means it can target many models in the HF catalog, at least in theory. The author tested the approach on an RTX 2060 with 6 GB of VRAM, so most testing has been on sub-3B models, though larger models are said to work too.

The interesting bit

The code manually walks model layer attributes—typically model.model.layers—to locate and edit the refusal vector, rather than relying on interpretability frameworks. That direct approach broadens compatibility, but it also means custom architectures can trip it up; the README notes that some Qwen variants use model.transformer.h instead and currently fail.

Key highlights

Runs without TransformerLens, relying purely on transformers and standard model objects.
Tested on consumer hardware (RTX 2060 6 GB), mostly with <3B parameter models.
Computes the refusal direction from paired harmful and harmless instruction datasets.
Supports mixed quantization settings between the analysis and inference phases.
The technique is drawn from a LessWrong post positing that refusal is mediated by a single direction.

Caveats

The author explicitly calls the implementation “crude” and proof-of-concept.
Some models with custom architectures (e.g., certain Qwen implementations) fail because layer attribute names differ from the hardcoded expectations.
Only a narrow range of small models have been rigorously tested due to GPU memory constraints.

Verdict

Mechanistic interpretability researchers and red-teamers curious about low-dependency refusal removal should take a look; anyone needing a polished, architecture-agnostic jailbreak tool should look elsewhere.

Frequently asked

What is Sumandora/remove-refusals-with-transformers?: Proof-of-concept code that removes an LLM’s refusal behavior by subtracting a single direction from its hidden states, using only standard Hugging Face Transformers.
Is remove-refusals-with-transformers open source?: Yes — Sumandora/remove-refusals-with-transformers is open source, released under the Apache-2.0 license.
What language is remove-refusals-with-transformers written in?: Sumandora/remove-refusals-with-transformers is primarily written in Python.
How popular is remove-refusals-with-transformers?: Sumandora/remove-refusals-with-transformers has 2k stars on GitHub.
Where can I find remove-refusals-with-transformers?: Sumandora/remove-refusals-with-transformers is on GitHub at https://github.com/Sumandora/remove-refusals-with-transformers.