Researchers at Northwestern University and EleutherAI unveil a groundbreaking system that generates natural language explanations for millions of neural features, enhancing clarity and advancing the field of AI interpretability.
Research: Automatically Interpreting Millions of Features in Large Language Models. Image Credit: Aree_S / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, Northwestern University researchers, in collaboration with EleutherAI, focused on using sparse autoencoders (SAEs) to transform deep neural network activations into a higher-dimensional latent space for more interpretable features. The study utilized open-weight LLMs, such as Llama and Gemma, to perform these experiments.
The authors built an automated pipeline that generated and evaluated natural language explanations for these features using large language models (LLMs). They proposed new techniques for scoring the quality of explanations and demonstrated that SAE features were more interpretable than individual neurons, facilitating large-scale analysis.
Background
LLMs have demonstrated impressive performance across various domains, but understanding their internal representations remains challenging. Early research focused on analyzing individual neuron activations, revealing that many neurons are "polysemantic," meaning they activate in diverse contexts, making them difficult to interpret.
SAEs were introduced as a solution, transforming neuron activations into a sparse, higher-dimensional latent space that is potentially more interpretable. The study leveraged datasets such as RedPajama-v2, noting sparsity patterns and activation rates, which impacted interpretability outcomes.
Previous work in this area relied on LLMs to generate explanations for neurons by analyzing their activation patterns. However, as LLMs scale, manually explaining millions of SAE features becomes infeasible. Existing methods, such as using generative pre-trained transformers (GPT)-4 to explain neuron activations, have limitations in terms of scalability and accuracy.
This paper filled these gaps by presenting an automated framework that generated natural language explanations for millions of SAE latents using LLMs.
The framework introduced efficient techniques for evaluating explanation quality and proposed guidelines for generating more robust explanations. By scaling this process across multiple models and architectures, the authors provided a comprehensive set of high-quality explanations, helping advance the interpretability of large models and supporting downstream applications like model steering and concept localization.
Automated Latent Interpretation Method
The researchers developed an automated framework for interpreting SAE latents within LLMs, using methods to optimize explanation generation and improve evaluation metrics. The authors collected SAE latent activations over a sample of 10 million tokens from the RedPajama-v2 dataset, training their models on Llama 3.1. They identified that while many latents activated frequently, some exhibited sparse activation, with context size influencing activation rates.
The findings highlighted the importance of dataset choice, showing that a larger 131k latent SAE learned more dataset-specific features compared to smaller configurations.
This study also examined Gemma 2 9b SAEs, noting that larger models learned more dataset-specific latents than smaller ones. The explanation generation process involved prompting an explainer model with token examples, emphasizing activating tokens and their strengths.
The authors found that focusing on top-activating examples produced concise explanations but sometimes failed to capture the full range of latent activations. To address this, the researchers introduced broader sampling strategies that could better reflect the diversity of latent activations.
The researchers introduced several scoring methods to assess explanation quality. This included detection (to classify activating/non-activating contexts), fuzzing (which evaluated activations at the token level), and surprisal scoring (which measured cross-entropy loss reduction).
Additionally, embedding and intervention scoring methods assessed explanation quality based on context retrieval and the effect of features on model output. Intervention scoring, a novel contribution, was particularly effective in identifying features overlooked by traditional context-based methods.
Overall, the framework improves interpretability for large models by offering scalable, efficient methods for generating and evaluating natural language explanations of SAEs, filling gaps in existing approaches.
Evaluation of Scoring and Explanation Methods
The authors compared various scoring and explanation methods for interpreting SAE latents. In scoring methods, fuzzing and detection showed the highest correlations with established simulation scoring, suggesting they were practical alternatives for SAE interpretation.
Embedding scoring, though faster, correlated more with detection than fuzzing. Intervention scoring stood out for identifying features missed by context-based methods and distinguishing between trained and random features. The researchers employed alignment techniques, such as the Hungarian algorithm, to ensure semantic consistency across layers, allowing for accurate feature comparisons between layers.
The explanation methods were tested on 500+ latents, revealing that explanations using a broader sampling of activating examples performed better than focusing solely on the top examples. Larger explainer models improved explanation scores, though Claude Sonnet 3.5 didn't significantly outperform Llama 3.1 70b.
SAEs with more latents achieved higher interpretability scores, with residual stream SAEs performing slightly better than those trained on multilayer perceptron (MLP) outputs. Neurons, while sparser, underperformed compared to SAEs. Feature overlap analysis suggested training SAEs on fewer residual stream layers might be computationally efficient, though focusing on MLP outputs might yield greater feature diversity.
The study provided a comparative analysis, showing that SAEs trained on specific datasets (like RedPajama vs. Pile) had similar interpretability outcomes despite differences in activation sparsity.
The authors highlighted the potential of embedding and intervention scoring for scalable, effective SAE interpretation. They explored the use of the Hungarian algorithm for aligning latent features across layers, improving semantic consistency in explanations.
Conclusion
In conclusion, the researchers presented a novel automated framework for interpreting SAEs within LLMs, enhancing the explainability of deep neural network activations. By employing innovative scoring techniques and optimizing the explanation generation process, the authors successfully demonstrated that SAEs offered superior interpretability compared to individual neurons.
The framework's efficiency in generating high-quality explanations for millions of SAEs addressed previous scalability challenges, ultimately advancing model interpretability.
Additionally, the findings underscored the importance of context and broader sampling strategies in improving explanation quality, laying the groundwork for future research in automated interpretability methods. The study suggests that prioritizing a limited set of layers for SAE training and focusing on residual streams could yield better interpretability results with less computational overhead.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Paulo, G., Mallen, A., Juang, C., & Belrose, N. (2024). Automatically Interpreting Millions of Features in Large Language Models. ArXiv.org. DOI:10.48550/arXiv.2410.13928, https://arxiv.org/abs/2410.13928