Mechanistic interpretability
| Part of a series on |
| Machine learning and data mining |
|---|
Mechanistic interpretability (often shortened to "Mech Interp" or "MI") is a subfield of interpretability that seeks to reverse‑engineer neural networks, generally perceived as a black box, into human‑understandable components or "circuits", revealing the causal pathways by which models process information. The object of study generally includes but is not limited to vision models and Transformer-based large language models (LLMs).