TransformerLensOrg/TransformerLens
A library for inspecting and editing the internal activations of GPT-2 style language models to reverse-engineer learned algorithms.

TransformerLens is a mechanistic interpretability library that loads over 50 open source language models and exposes their internal activations. It allows users to cache any intermediate activation and attach functions to edit, remove, or replace activations as the model runs. The library’s goal is to reverse engineer the algorithms learned by trained models directly from their weights, supporting research into understanding how large language models work internally.