Sumandora/remove-refusals-with-transformers
A proof-of-concept tool that removes refusal behavior from LLM models by steering activations using pure HuggingFace Transformers.

This project implements a technique for removing harmful/harmless refusal behavior from LLM models without using TransformerLens. It computes refusal directions from model activations and applies them during inference to suppress refusal behavior. The implementation works with HuggingFace Transformers and supports quantized models on consumer GPUs. The approach is based on the finding that refusal in LLMs is mediated by a single direction in activation space.