Automated Framework Enhances Neural Network Interpretability With Scalable Explanations

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 27 2024

Researchers at Northwestern University and EleutherAI unveil a groundbreaking system that generates natural language explanations for millions of neural features, enhancing clarity and advancing the field of AI interpretability.

Research: Automatically Interpreting Millions of Features in Large Language Models. Image Credit: Aree_S / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, Northwestern University researchers, in collaboration with EleutherAI, focused on using sparse autoencoders (SAEs) to transform deep neural network activations into a higher-dimensional latent space for more interpretable features. The study utilized open-weight LLMs, such as Llama and Gemma, to perform these experiments.

The authors built an automated pipeline that generated and evaluated natural language explanations for these features using large language models (LLMs). They proposed new techniques for scoring the quality of explanations and demonstrated that SAE features were more interpretable than individual neurons, facilitating large-scale analysis.

Background

LLMs have demonstrated impressive performance across various domains, but understanding their internal representations remains challenging. Early research focused on analyzing individual neuron activations, revealing that many neurons are "polysemantic," meaning they activate in diverse contexts, making them difficult to interpret.

SAEs were introduced as a solution, transforming neuron activations into a sparse, higher-dimensional latent space that is potentially more interpretable. The study leveraged datasets such as RedPajama-v2, noting sparsity patterns and activation rates, which impacted interpretability outcomes.

Previous work in this area relied on LLMs to generate explanations for neurons by analyzing their activation patterns. However, as LLMs scale, manually explaining millions of SAE features becomes infeasible. Existing methods, such as using generative pre-trained transformers (GPT)-4 to explain neuron activations, have limitations in terms of scalability and accuracy.

This paper filled these gaps by presenting an automated framework that generated natural language explanations for millions of SAE latents using LLMs.

The framework introduced efficient techniques for evaluating explanation quality and proposed guidelines for generating more robust explanations. By scaling this process across multiple models and architectures, the authors provided a comprehensive set of high-quality explanations, helping advance the interpretability of large models and supporting downstream applications like model steering and concept localization.

Automated Latent Interpretation Method

The researchers developed an automated framework for interpreting SAE latents within LLMs, using methods to optimize explanation generation and improve evaluation metrics. The authors collected SAE latent activations over a sample of 10 million tokens from the RedPajama-v2 dataset, training their models on Llama 3.1. They identified that while many latents activated frequently, some exhibited sparse activation, with context size influencing activation rates.

The findings highlighted the importance of dataset choice, showing that a larger 131k latent SAE learned more dataset-specific features compared to smaller configurations.

SAE latents explanations in a random sentence. To visualize the latent explanations produced, we select a sentence taken from the RPJv2 dataset. We selected 4 tokens in different positions in that sentence and filter for latents that are active in different layers. Then we randomly select active latents and their corresponding explanations to display. We display the detection and fuzzing scores of each explanation, which indicate how well it explains other examples in the dataset (see section 3 for details on these scores). The features selected had high activation, but were not cherry-picked based on explanations or scores.

This study also examined Gemma 2 9b SAEs, noting that larger models learned more dataset-specific latents than smaller ones. The explanation generation process involved prompting an explainer model with token examples, emphasizing activating tokens and their strengths.

The authors found that focusing on top-activating examples produced concise explanations but sometimes failed to capture the full range of latent activations. To address this, the researchers introduced broader sampling strategies that could better reflect the diversity of latent activations.

The researchers introduced several scoring methods to assess explanation quality. This included detection (to classify activating/non-activating contexts), fuzzing (which evaluated activations at the token level), and surprisal scoring (which measured cross-entropy loss reduction).

Additionally, embedding and intervention scoring methods assessed explanation quality based on context retrieval and the effect of features on model output. Intervention scoring, a novel contribution, was particularly effective in identifying features overlooked by traditional context-based methods.

Overall, the framework improves interpretability for large models by offering scalable, efficient methods for generating and evaluating natural language explanations of SAEs, filling gaps in existing approaches.

Evaluation of Scoring and Explanation Methods

The authors compared various scoring and explanation methods for interpreting SAE latents. In scoring methods, fuzzing and detection showed the highest correlations with established simulation scoring, suggesting they were practical alternatives for SAE interpretation.

Embedding scoring, though faster, correlated more with detection than fuzzing. Intervention scoring stood out for identifying features missed by context-based methods and distinguishing between trained and random features. The researchers employed alignment techniques, such as the Hungarian algorithm, to ensure semantic consistency across layers, allowing for accurate feature comparisons between layers.

The explanation methods were tested on 500+ latents, revealing that explanations using a broader sampling of activating examples performed better than focusing solely on the top examples. Larger explainer models improved explanation scores, though Claude Sonnet 3.5 didn't significantly outperform Llama 3.1 70b.

SAEs with more latents achieved higher interpretability scores, with residual stream SAEs performing slightly better than those trained on multilayer perceptron (MLP) outputs. Neurons, while sparser, underperformed compared to SAEs. Feature overlap analysis suggested training SAEs on fewer residual stream layers might be computationally efficient, though focusing on MLP outputs might yield greater feature diversity.

The study provided a comparative analysis, showing that SAEs trained on specific datasets (like RedPajama vs. Pile) had similar interpretability outcomes despite differences in activation sparsity.

The authors highlighted the potential of embedding and intervention scoring for scalable, effective SAE interpretation. They explored the use of the Hungarian algorithm for aligning latent features across layers, improving semantic consistency in explanations.

Conclusion

In conclusion, the researchers presented a novel automated framework for interpreting SAEs within LLMs, enhancing the explainability of deep neural network activations. By employing innovative scoring techniques and optimizing the explanation generation process, the authors successfully demonstrated that SAEs offered superior interpretability compared to individual neurons.

The framework's efficiency in generating high-quality explanations for millions of SAEs addressed previous scalability challenges, ultimately advancing model interpretability.

Additionally, the findings underscored the importance of context and broader sampling strategies in improving explanation quality, laying the groundwork for future research in automated interpretability methods. The study suggests that prioritizing a limited set of layers for SAE training and focusing on residual streams could yield better interpretability results with less computational overhead.

Journal reference:

Preliminary scientific report. Paulo, G., Mallen, A., Juang, C., & Belrose, N. (2024). Automatically Interpreting Millions of Features in Large Language Models. ArXiv.org. DOI:10.48550/arXiv.2410.13928, https://arxiv.org/abs/2410.13928

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 27). Automated Framework Enhances Neural Network Interpretability With Scalable Explanations. AZoAi. Retrieved on October 17, 2025 from https://www.azoai.com/news/20241027/Automated-Framework-Enhances-Neural-Network-Interpretability-With-Scalable-Explanations.aspx.
MLA
Nandi, Soham. "Automated Framework Enhances Neural Network Interpretability With Scalable Explanations". AZoAi. 17 October 2025. <https://www.azoai.com/news/20241027/Automated-Framework-Enhances-Neural-Network-Interpretability-With-Scalable-Explanations.aspx>.
Chicago
Nandi, Soham. "Automated Framework Enhances Neural Network Interpretability With Scalable Explanations". AZoAi. https://www.azoai.com/news/20241027/Automated-Framework-Enhances-Neural-Network-Interpretability-With-Scalable-Explanations.aspx. (accessed October 17, 2025).
Harvard
Nandi, Soham. 2024. Automated Framework Enhances Neural Network Interpretability With Scalable Explanations. AZoAi, viewed 17 October 2025, https://www.azoai.com/news/20241027/Automated-Framework-Enhances-Neural-Network-Interpretability-With-Scalable-Explanations.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.