Mechanistic Interpretability

Reproducing Emotion Vector Part I

An independent reproduction of Anthropic's emotion vector research using the open-weight Llama 3.1 8B model. We confirm 9 of 11 verification criteria and uncover how safety alignment shapes causal steering behavior.