In an article recently posted to the Meta Research website, researchers explored mechanistic interpretability (MI) in neural networks, which aimed to understand their decision-making processes. They argued that high-dimensional neural networks could learn useful low-dimensional representations of their training data, providing insights aligned with human domain knowledge. The authors used nuclear physics as a case study to demonstrate how MI could reveal new understandings of complex problems through models trained to solve them.
Background
Understanding high-dimensional phenomena through low-dimensional theories is a core aspect of scientific inquiry and machine learning. Recent advancements have demonstrated that deep learning models, much like traditional scientific methods, extract low-dimensional representations from complex, high-dimensional data.
Notably, research has focused on disentangling these representations to identify meaningful, interpretable factors of variation. However, despite the success in various domains, gaps remain in fully leveraging these models for MI, particularly in translating their findings into actionable scientific insights.
This paper addressed these gaps by applying machine learning to nuclear physics, a field known for its intricate theories and data. By examining how neural networks trained on nuclear data uncover and mirror human-derived nuclear theories, this study extended MI beyond mere prediction accuracy. It illustrated how deep learning models could provide valuable scientific insights, aligning their learned representations with established physical concepts.
Modular Arithmetic Primer
Recent research into interpretability has examined models trained on modular arithmetic tasks, revealing clean, insightful embeddings. These models, once understood mechanistically, provided clear algorithms and progress measures for generalization. This paper built on that foundation by introducing latent space topography (LST), a technique for visualizing and interpreting model embeddings.
Applying this method to modular addition revealed that the network computed modular sums by averaging embeddings and indexing results. The authors extended this approach to nuclear physics, exploring if models trained on this complex domain could reveal and align with established scientific concepts.
A Physics Case Study
Nuclear physics served as an ideal case study due to its rich history of well-established theories and ongoing challenges. The research involved analyzing principal component projections of nuclear data, which revealed periodic and helical structures, suggesting significant interpretability potential.
The experimental setup included predicting properties like binding energy and charge radius for various nuclei, using embeddings and attention mechanisms to model these predictions. The goal was to extract and compare model-derived features with established nuclear theories, providing a deeper understanding and validation of model interpretations.
Principle Component Analysis
Principal component analysis (PCA) is a common tool for dimensionality reduction, but its assumptions can lead to misleading results, such as phantom oscillations in non-oscillatory data. Despite these issues, PCA remains valuable, as evidenced by its ability to capture performance in models with fewer principal components.
For example, low-rank approximations often retain performance with fewer PCs. Additionally, PCA can reveal rich, informative structures in data, such as periodicity and even-odd splits in embeddings from models trained on modular arithmetic and nuclear physics, enhancing interpretability beyond noise.
Analysis of Embedding Structures and Model Interpretability
The experiments explored the interpretability of neural network embeddings for nuclear physics data, specifically focusing on proton and neutron numbers. Initial findings, supported by analogies in language models, suggested that the embeddings exhibited interpretable structures.
PCA revealed a notable helical pattern in proton number embeddings, indicating an ordered structure that aided in binding energy predictions. This helix structure aligned with known nuclear physics principles, such as the volume term of the semi-empirical mass formula.
Further analysis of embeddings highlighted a correlation between their order and model generalization performance. An interesting feature observed was the separation of even and odd numbers in the embeddings, which impacted model accuracy, suggesting that parity played a significant role. In addition, experiments with the model's hidden layers revealed that the principal components of these layers corresponded to known physics terms, such as the volume and pairing terms of the semi-empirical mass formula and shell theory corrections.
Conclusion
In conclusion, the researchers demonstrated the potential of MI in neural networks to yield valuable scientific insights. By applying MI to nuclear physics, the authors revealed that neural network embeddings capture interpretable structures, such as helical patterns and parity splits, that align with established physical theories like the semi-empirical mass formula.
Analysis of hidden layer activations further supported that these models learned and utilized physically meaningful representations. The use of latent space topography provided a comprehensive understanding of the models' predictive algorithms. These results affirmed that neural networks could not only predict but also elucidate complex scientific phenomena, paving the way for future discoveries.