Introduction
If we think of interpretability as a kind of “anatomy of neural networks,” most of the circuits thread has involved studying tiny little veins – looking at the small-scale, at individual neurons and how they connect. However, there are many natural questions that the small-scale approach doesn’t address.
In contrast, the most prominent abstractions in biological anatomy involve larger-scale structures: individual organs like the heart, or entire organ systems like the respiratory system. And so we wonder: is there a “respiratory system” or “heart” or “brain region” of an artificial neural network? Do neural networks have any emergent structures that we could study that are larger-scale than circuits?
This article describes branch specialization, one of three larger “structural phenomena” we’ve been able observe in neural networks. (The other two, equivariance and weight banding, have separate dedicated articles.) Branch specialization occurs when neural network layers are split up into branches. The neurons and circuits tend to self-organize, clumping related functions into each branch and forming larger functional units – a kind of “neural network brain region.” We find evidence that these structures implicitly exist in neural networks without branches, and that branches are simply reifying structures that otherwise exist.
The earliest example of branch specialization that we’re aware of comes from AlexNet
The first two layers of AlexNet are split into two branches which can’t communicate until they rejoin after the second layer. This structure was used to maximize the efficiency of training the model on two GPUs, but the authors noticed something very curious happened as a result. The neurons in the first layer organized themselves into two groups: black-and-white Gabor filters formed on one branch and low-frequency color detectors formed on the other branch.
Although the first layer of AlexNet is the only example of branch specialization we’re aware of being discussed in the literature, it seems to be a common phenomenon. We find that branch specialization happens in later hidden layers, not just the first layer. It occurs in both low-level and high-level features. It occurs in a wide range of models, including places you might not expect it – for example, residual blocks in resnets can functionally be branches and specialize. Finally, branch specialization appears to surface as a structural phenomenon in plain convolutional nets, even without any particular structure causing it.
Is there a large-scale structure to how neural networks operate? How are features and circuits organized within the model? Does network architecture influence the features and circuits that form? Branch specialization hints at an exciting story related to all of these questions.
What is a branch?
Many neural network architectures have branches, sequences of layers which temporarily don’t have access to “parallel” information which is still passed to later layers.
In the past, models with explicitly-labeled branches were popular (such as AlexNet and the Inception family of networks
The implicit branching of residual networks has some important nuances. At first glance, every layer is a two-way branch. But because the branches are combined together by addition, we can actually rewrite the model to reveal that the residual blocks can be understood as branches in parallel:
We typically see residual blocks specialize in very deep residual networks (e.g. ResNet-152). One hypothesis for why is that, in these models, the exact depth of a layer doesn’t matter and the branching aspect becomes more important than the sequential aspect.
One of the conceptual weaknesses of normal branching models is that although branches can save parameters, it still requires a lot of parameters to mix values between branches. However, if you buy the branch interpretation of residual networks, you can see them as a strategy to sidestep this: residual networks intermix branches (e.g. block sparse weights) with low-rank connections (projecting all the blocks into the same sum and then back up). This seems like a really elegant way to handle branching. More practically, it suggests that analysis of residual networks might be well-served by paying close attention to the units in the blocks, and that we might expect the residual stream to be unusually polysemantic.
Why does branch specialization occur?
Branch specialization is defined by features organizing between branches. In a normal layer, features are organized randomly: a given feature is just as likely to be any neuron in a layer. But in a branched layer, we often see features of a given type cluster to one branch. The branch has specialized on that type of feature.
How does this happen? Our intuition is that there’s a positive feedback loop during training.
Another way to think about this is that if you need to cut a neural network into pieces that have limited ability to communicate with each other, it makes sense to organize similar features close together, because they probably need to share more information.
Branch specialization beyond the first layer
So far, the only concrete example we’ve shown of branch specialization is the first and second layer of AlexNet. What about later layers? AlexNet also splits its later layers into branches, after all. This seems to be unexplored, since studying features after the first layer is much harder.
Unfortunately, branch specialization in the later layers of AlexNet is also very subtle. Instead of one overall split, it’s more like there’s dozens of small clusters of neurons, each cluster being assigned to a branch. It’s hard to be confident that one isn’t just seeing patterns in noise.
But other models have very clear branch specialization in later layers. This tends to happen when a branch constitutes only a very small fraction of a layer, either because there are many branches or because one is much smaller than others. In these cases, the branch can specialize on a very small subset of the features that exist in a layer and reveal a clear pattern.
For example, most of InceptionV1′s layers have a branched structure. The branches have varying numbers of units, and varying convolution sizes. The 5×5 branch is the smallest branch, and also has the largest convolution size. It’s often very specialized:
This is exceptionally unlikely to have occurred by chance.
For example, all 9 of the black and white vs. color detectors in mixed3a
are in mixed3a_5x5
, despite it only being 32 out of the 256 neurons in the layer. The probability of that happening by chance is less than 1/108. For a more extreme example, all 30 of the curve-related features in mixed3b
are in mixed3b_5x5
, despite it being only 96 out of the 480 neurons in the layer. The probability of that happening by chance is less than 1/1020.
It’s worth noting one confounding factor which might be influencing the specialization. The 5×5 branches are the smallest branches, but also have larger convolutions (5×5 instead of 3×3 or 1×1) than their neighbors.mixed3a_5x5
branch or color in the mixed3b_5x5
branch?
Why is branch specialization consistent?
Perhaps the most surprising thing about branch specialization is that the same branch specializations seem to occur again and again, across different architectures and tasks.
For example, the branch specialization we observed in AlexNet – the first layer specializing into a black-and-white Gabor branch vs. a low-frequency color branch – is a surprisingly robust phenomenon. It occurs consistently if you retrain AlexNet. It also occurs if you train other architectures with the first few layers split into two branches. It even occurs if you train those models on other natural image datasets, like Places instead of ImageNet. Anecdotally, we also seem to see other types of branch specialization recur. For example, finding branches that seem to specialize in curve detection seems to be quite common (although InceptionV1′s mixed3b_5x5
is the only one we’ve carefully characterized).
So, why do the same branch specializations occur again and again?
One hypothesis seems very tempting. Notice that many of the same features that form in normal, non-branched models also seem to form in branched models. For example, the first layer of both branched and non-branched models contain Gabor filters and color features. If the same features exist, presumably the same weights exist between them.
Could it be that branching is just surfacing a structure that already exists? Perhaps there’s two different subgraphs between the weights of the first and second conv layer in a normal model, with relatively small weights between them, and when you train a branched model these two subgraphs latch onto the branches.
(This would be directionally similar to work finding modular substructures
To test this, let’s look at models which have non-branched first and second convolutional layers. Let’s take the weights between them and perform a singular value decomposition (SVD) on the absolute values of the weights. This will show us the main factors of variation in which neurons connect to different neurons in the next layer (irrespective of whether those connections are excitatory or inhibitory).
Sure enough, the singular vector (the largest factor of variation) of the weights between the first two convolutional layers of InceptionV1 is color.
We also see that the second factor appears to be frequency. This suggests an interesting prediction: perhaps if we were to split the layer into more than two branches, we’d also observe specialization in frequency in addition to color.
This seems like it may be true. For example, here we see a high-frequency black-and-white branch, a mid-frequency mostly black-and-white branch, a mid-frequency color branch, and a low-frequency color branch.
Parallels to neuroscience
We’ve shown that branch specialization is one example of a structural phenomenon — a larger-scale structure in a neural network. It happens in a variety of situations and neural network architectures, and it happens with consistency – certain motifs of specialization, such as color, frequency, and curves, happen consistently across different architectures and tasks.
Returning to our comparison with anatomy, although we hesitate to claim explicit parallels to neuroscience, it’s tempting to draw analogies between branch specialization and the existence of regions of the brain focused on particular tasks.
The visual cortex, the auditory cortex, Broca’s area and Wernicke’s area
The subspecialization within the V2 area of the primate visual cortex is another strong example from neuroscience. One type of stripe within V2 is sensitive to orientation or luminance, whereas the other type of stripe contains color-selective neurons.
We are grateful to Patrick Mineault for noting this analogy, and for further noting that the high-frequency features are consistent with some of the known representations of high-level features in the primate V2 area.
– these are all examples of brain areas with such consistent specialization across wide populations of people that neuroscientists and psychologists have been able to characterize as having remarkably consistent functions.
As researchers without expertise in neuroscience, we’re uncertain how useful this connection is, but it may be worth considering whether branch specialization can be a useful model of how specialization might emerge in biological neural networks.
Author Contributions
As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.
Research. The phenomenon of branch specialization was initially observed by Chris Olah. Chris also developed the weight PCA experiments suggesting that it implicitly occurs in non-branched models. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Ludwig Schubert, and Chris. Chelsea and Nick contributed to framing this work in the importance of larger scale structures on top of circuits.
Infrastructure. Branch specialization was only discovered because an early version of Microscope by Ludwig Schubert made it easy to browse the neurons that exist at certain layers. Michael Petrov, Ludwig and Nick built a variety of infrastructural tools which made our research possible.
Writing and Diagrams. Chelsea wrote the article, based on an initial article by Chris and with Chris’s help. Diagrams were illustrated by both Chelsea and Chris.
Acknowledgments
We are grateful to Brice Ménard for pushing us to investigate whether we can find larger-scale structures such as the one investigated here.
We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Matt Nolan, and Vincent Tjeng for their remarks on a first draft. We’re grateful to Patrick Mineault for noting the neuroscience comparison to subspecialization within primate V2.
References
- Imagenet classification with deep convolutional neural networks
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Advances in neural information processing systems, Vol 25, pp. 1097–1105. - Visualizing higher-layer features of a deep network [PDF]
Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341, pp. 3. - Deep inside convolutional networks: Visualising image classification models and saliency maps
Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034. - Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks [PDF]
Nguyen, A., Yosinski, J. and Clune, J., 2016. arXiv preprint arXiv:1602.03616. - Feature Visualization https://distill.pub/2020/circuits/branch-specialization
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007 - Going deeper with convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. - Neural architecture search with reinforcement learning
Zoph, B. and Le, Q.V., 2016. arXiv preprint arXiv:1611.01578. - Neural networks are surprisingly modular
Filan, D., Hod, S., Wild, C., Critch, A. and Russell, S., 2020. arXiv preprint arXiv:2003.04881. - Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
Csordás, R., Steenkiste, S.v. and Schmidhuber, J., 2020. - Segregation of form, color, and stereopsis in primate area 18 https://distill.pub/2020/circuits/branch-specialization
Hubel, D. and Livingstone, M., 1987. Journal of Neuroscience, Vol 7(11), pp. 3378–3415. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.07-11-03378.1987 - Representation of Angles Embedded within Contour Stimuli in Area V2 of Macaque Monkeys https://distill.pub/2020/circuits/branch-specialization
Ito, M. and Komatsu, H., 2004. Journal of Neuroscience, Vol 24(13), pp. 3313–3324. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.4364-03.2004
Updates and Corrections
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Reuse
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
Citation
For attribution in academic contexts, please cite this work as
Voss, et al., "Branch Specialization", Distill, 2021.
BibTeX citation
@article{voss2021branch, author = {Voss, Chelsea and Goh, Gabriel and Cammarata, Nick and Petrov, Michael and Schubert, Ludwig and Olah, Chris}, title = {Branch Specialization}, journal = {Distill}, year = {2021}, note = {https://distill.pub/2020/circuits/branch-specialization}, doi = {10.23915/distill.00024.008} }