← all repositories
p-e-w/heretic

Un-safetuning LLMs with a single CLI command

Heretic automatically strips safety alignment from transformer models without retraining, using optimization to find the least destructive way to make them stop refusing.

23.9k stars Python Language Models
heretic
Velocity · 7d
+92
★ / day
Trend
steady
star history

What it does

Heretic is a command-line tool that removes “safety alignment” — the trained-in refusal behavior — from language models. You run heretic <model-name> and it outputs a decensored version, no manual tuning required. It works by finding directions in the model’s internal representations that correspond to refusal, then ablating them while minimizing how much the rest of the model’s behavior shifts.

The interesting bit

The clever part isn’t the abliteration technique itself (that’s established research); it’s the automation. Heretic uses Optuna’s TPE optimizer to search for abliteration parameters that simultaneously minimize refusals and KL divergence from the original model. This co-optimization is what lets it run unsupervised and still beat hand-tuned abliterations on metrics like the Gemma-3-12B benchmark table shows: same 3/100 refusal rate as manual versions, but KL divergence of 0.16 versus 0.45 or 1.04.

Key highlights

  • Supports dense transformers, multimodal models, MoE architectures, and hybrids like Qwen3.5
  • ~20-30 minutes to decensor a 4B model on an RTX 3090; auto-detects optimal batch size
  • Optional bitsandbytes 4-bit quantization for VRAM-constrained runs
  • Built-in evaluation mode to reproduce benchmark numbers against original models
  • Research extras include PaCMAP residual visualization and geometric analysis tables
  • Community has published well over 3000 Heretic-derived models on Hugging Face

Caveats

  • Pure state-space models (Mamba, etc.) and some research architectures aren’t supported yet
  • PaCMAP plotting is CPU-bound and can take an hour+ for larger models
  • PyTorch 2.2 minimum, but some newer model formats need 2.6+ features

Verdict

Worth a look if you’re running local LLMs and tired of models refusing benign requests, or if you’re doing mechanistic interpretability research and want automated residual analysis. Skip it if you’re satisfied with cloud APIs or your use case doesn’t hit alignment boundaries.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.