mechanistic interpretability with sparse autoencoders - a rchan26 Collection

rchan26 's Collections

mechanistic interpretability with sparse autoencoders

multilingual vision models

mechanistic interpretability with sparse autoencoders

updated Sep 3

A collection of papers that I found useful for learning about using Sparse Autoencoders for finding interpretable features in language models

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Paper • 2309.08600 • Published Sep 15, 2023 • 13
Scaling and evaluating sparse autoencoders

Paper • 2406.04093 • Published Jun 6 • 2
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Paper • 2403.19647 • Published Mar 28 • 3
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9 • 37
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Paper • 2407.14435 • Published Jul 19 • 6
Interpreting Attention Layer Outputs with Sparse Autoencoders

Paper • 2406.17759 • Published Jun 25
Disentangling Dense Embeddings with Sparse Autoencoders

Paper • 2408.00657 • Published Aug 1
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Paper • 2405.12522 • Published May 21 • 2
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Paper • 2405.08366 • Published May 14 • 2