← all repositories

Sumandora/remove-refusals-with-transformers

A proof-of-concept tool that removes refusal behavior from LLM models by steering activations using pure HuggingFace Transformers.

1.9k stars Python Language ModelsLLMOps · Eval
remove-refusals-with-transformers
Velocity · 7d
+2.5
★ / day
Trend
steady
star history

This project implements a technique for removing harmful/harmless refusal behavior from LLM models without using TransformerLens. It computes refusal directions from model activations and applies them during inference to suppress refusal behavior. The implementation works with HuggingFace Transformers and supports quantized models on consumer GPUs. The approach is based on the finding that refusal in LLMs is mediated by a single direction in activation space.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.