Mechanistic interpretability

Mechanistic interpretability (often shortened to "Mech Interp" or "MI") is a subfield of interpretability that seeks to reverse‑engineer neural networks, generally perceived as a black box, into human‑understandable components or "circuits", revealing the causal pathways by which models process information. The object of study generally includes but is not limited to vision models and Transformer-based large language models (LLMs).