Weight Banding

Branch Specialization


This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.


If we think of interpretability as a kind of “anatomy of neural networks,” most of the circuits thread has involved studying tiny little veins – looking at the small-scale, at individual neurons and how they connect. However, there are many natural questions that the small-scale approach doesn’t address.

In contrast, the most prominent abstractions in biological anatomy involve larger-scale structures: individual organs like the heart, or entire organ systems like the respiratory system. And so we wonder: is there a “respiratory system” or “heart” or “brain region” of an artificial neural network? Do neural networks have any emergent structures that we could study that are larger-scale than circuits?

This article describes branch specialization, one of three larger “structural phenomena” we’ve been able observe in neural networks. (The other two, equivariance and weight banding, have separate dedicated articles.) Branch specialization occurs when neural network layers are split up into branches. The neurons and circuits tend to self-organize, clumping related functions into each branch and forming larger functional units – a kind of “neural network brain region.” We find evidence that these structures implicitly exist in neural networks without branches, and that branches are simply reifying structures that otherwise exist.

The earliest example of branch specialization that we’re aware of comes from AlexNet. AlexNet is famous as a jump in computer vision, arguably starting the deep learning revolution, but buried in the paper is a fascinating, rarely-discussed detail.

The first two layers of AlexNet are split into two branches which can’t communicate until they rejoin after the second layer. This structure was used to maximize the efficiency of training the model on two GPUs, but the authors noticed something very curious happened as a result. The neurons in the first layer organized themselves into two groups: black-and-white Gabor filters formed on one branch and low-frequency color detectors formed on the other branch.

1. Branch specialization in the first two layers of AlexNet. Krizhevsky et al. observed the phenomenon we call branch specialization in the first layer of AlexNet by visualizing their weights to RGB channels; here, we use feature visualization to show how this phenomenon extends to the second layer of each branch.

Although the first layer of AlexNet is the only example of branch specialization we’re aware of being discussed in the literature, it seems to be a common phenomenon. We find that branch specialization happens in later hidden layers, not just the first layer. It occurs in both low-level and high-level features. It occurs in a wide range of models, including places you might not expect it – for example, residual blocks in resnets can functionally be branches and specialize. Finally, branch specialization appears to surface as a structural phenomenon in plain convolutional nets, even without any particular structure causing it.

Is there a large-scale structure to how neural networks operate? How are features and circuits organized within the model? Does network architecture influence the features and circuits that form? Branch specialization hints at an exciting story related to all of these questions.

What is a branch?

Many neural network architectures have branches, sequences of layers which temporarily don’t have access to “parallel” information which is still passed to later layers.

InceptionV1 has nine sets of four-way branches called “Inception blocks.” has several two-way branches. AlexNet Residual Networks aren’t typically thought of as having branches, but residual blocks can be seen as a type of branch.

2. Examples of branches in various types of neural network architectures.

In the past, models with explicitly-labeled branches were popular (such as AlexNet and the Inception family of networks). In more recent years, these have become less common, but residual networks – which can be seen as implicitly having branches in their residual blocks – have become very common. We also sometimes see branched architectures develop automatically in neural architecture search, an approach where the network architecture is learned.

The implicit branching of residual networks has some important nuances. At first glance, every layer is a two-way branch. But because the branches are combined together by addition, we can actually rewrite the model to reveal that the residual blocks can be understood as branches in parallel:

We typically think of residual blocks as sequential layers, building on top of each other. + + + + … but we can also conceptualize them as, to some extent, being parallel branches due to the skip connections. This means that residual blocks can potentially specialize. +

3. Residual blocks as branches in parallel.

We typically see residual blocks specialize in very deep residual networks (e.g. ResNet-152). One hypothesis for why is that, in these models, the exact depth of a layer doesn’t matter and the branching aspect becomes more important than the sequential aspect.

One of the conceptual weaknesses of normal branching models is that although branches can save parameters, it still requires a lot of parameters to mix values between branches. However, if you buy the branch interpretation of residual networks, you can see them as a strategy to sidestep this: residual networks intermix branches (e.g. block sparse weights) with low-rank connections (projecting all the blocks into the same sum and then back up). This seems like a really elegant way to handle branching. More practically, it suggests that analysis of residual networks might be well-served by paying close attention to the units in the blocks, and that we might expect the residual stream to be unusually polysemantic.

Why does branch specialization occur?

Branch specialization is defined by features organizing between branches. In a normal layer, features are organized randomly: a given feature is just as likely to be any neuron in a layer. But in a branched layer, we often see features of a given type cluster to one branch. The branch has specialized on that type of feature.

How does this happen? Our intuition is that there’s a positive feedback loop during training.

A1 B1 C1 D1 A2 B2 D2 D2 The first part of the branch is incentivized to form features relevant to the second half. The second half of the branch prefers features which the first half provides primitives for.

4. Hypothetical positive feedback loop of branch specialization during training.

Another way to think about this is that if you need to cut a neural network into pieces that have limited ability to communicate with each other, it makes sense to organize similar features close together, because they probably need to share more information.

Branch specialization beyond the first layer

So far, the only concrete example we’ve shown of branch specialization is the first and second layer of AlexNet. What about later layers? AlexNet also splits its later layers into branches, after all. This seems to be unexplored, since studying features after the first layer is much harder.For the first layer, one can visualize the RGB weights; for later layers, one needs to use feature visualization.

Unfortunately, branch specialization in the later layers of AlexNet is also very subtle. Instead of one overall split, it’s more like there’s dozens of small clusters of neurons, each cluster being assigned to a branch. It’s hard to be confident that one isn’t just seeing patterns in noise.

But other models have very clear branch specialization in later layers. This tends to happen when a branch constitutes only a very small fraction of a layer, either because there are many branches or because one is much smaller than others. In these cases, the branch can specialize on a very small subset of the features that exist in a layer and reveal a clear pattern.

For example, most of InceptionV1′s layers have a branched structure. The branches have varying numbers of units, and varying convolution sizes. The 5×5 branch is the smallest branch, and also has the largest convolution size. It’s often very specialized:

The 5×5 branch of mixed3a, a relatively early layer, is specialized on color detection, and especially black-and-white vs. color detection. mixed3a_5x5: It also contains a disproportionate number of boundary, eye, and fur detectors, many of which share sub-components with curves. This branch contains all 30 of the curve-related features for this layer (all curves, double curves, circles, spirals, S-shape and more features, etc). mixed3b_5x5: This branch appears to be specialized in complex shapes and 3D geometry detectors. We don’t have a full taxonomy of this layer to allow for a quantitative assessment. mixed4a_5x5: 3D Geometry / Complex Shapes Curve Related BW vs Color Fur/Eye/Face Related Other Boundary Detectors Other Other Brightness Other Color Contrast

5. Examples of branch specialization in mixed3a_5x5, mixed3b_5x5, and mixed4a_5x5.

This is exceptionally unlikely to have occurred by chance.

For example, all 9 of the black and white vs. color detectors in mixed3a are in mixed3a_5x5, despite it only being 32 out of the 256 neurons in the layer. The probability of that happening by chance is less than 1/108. For a more extreme example, all 30 of the curve-related features in mixed3b are in mixed3b_5x5, despite it being only 96 out of the 480 neurons in the layer. The probability of that happening by chance is less than 1/1020.

It’s worth noting one confounding factor which might be influencing the specialization. The 5×5 branches are the smallest branches, but also have larger convolutions (5×5 instead of 3×3 or 1×1) than their neighbors.There is something which suggests that the branching plays an essential role: mixed3a and mixed3b are adjacent layers which contain relatively similar features and are at the same scale. If it was only about convolution size, why don’t we see any curves in the mixed3a_5x5 branch or color in the mixed3b_5x5 branch?

Why is branch specialization consistent?

Perhaps the most surprising thing about branch specialization is that the same branch specializations seem to occur again and again, across different architectures and tasks.

For example, the branch specialization we observed in AlexNet – the first layer specializing into a black-and-white Gabor branch vs. a low-frequency color branch – is a surprisingly robust phenomenon. It occurs consistently if you retrain AlexNet. It also occurs if you train other architectures with the first few layers split into two branches. It even occurs if you train those models on other natural image datasets, like Places instead of ImageNet. Anecdotally, we also seem to see other types of branch specialization recur. For example, finding branches that seem to specialize in curve detection seems to be quite common (although InceptionV1′s mixed3b_5x5 is the only one we’ve carefully characterized).

So, why do the same branch specializations occur again and again?

One hypothesis seems very tempting. Notice that many of the same features that form in normal, non-branched models also seem to form in branched models. For example, the first layer of both branched and non-branched models contain Gabor filters and color features. If the same features exist, presumably the same weights exist between them.

Could it be that branching is just surfacing a structure that already exists? Perhaps there’s two different subgraphs between the weights of the first and second conv layer in a normal model, with relatively small weights between them, and when you train a branched model these two subgraphs latch onto the branches.

(This would be directionally similar to work finding modular substructures within neural networks.)

To test this, let’s look at models which have non-branched first and second convolutional layers. Let’s take the weights between them and perform a singular value decomposition (SVD) on the absolute values of the weights. This will show us the main factors of variation in which neurons connect to different neurons in the next layer (irrespective of whether those connections are excitatory or inhibitory).

Sure enough, the singular vector (the largest factor of variation) of the weights between the first two convolutional layers of InceptionV1 is color.

first convolutional layer Neurons in the organized by the left singular vectors of |W|. InceptionV1 (tf-slim version) trained on ImageNet. The first singular vector separates color and black and white, meaning that’s the largest dimension of variation in which neurons connect to which in the next layer. Gabor filters and color features are far apart, meaning they tend to connect to different features in the next layer. Singular Vector 1 (frequency?) Singular Vector 0 (color?) InceptionV1 trained on Places365 One more, the first singular vector separates color and black and white, meaning that’s the largest dimension of variation in which neurons connect to which in the next layer. Singular Vector 1 (frequency?) Singular Vector 0 (color?) Singular Vector 1 (frequency?) Singular Vector 0 (color?) Singular Vector 1 (frequency?) Singular Vector 0 (color?) second convolutional layer Neurons in the organized by the right singular vectors of |W|.

6. Singular vectors for the first and second convolutional layers of InceptionV1, trained on ImageNet (above) or Places365 (below). One can think of neurons being plotted closer together in this diagram as meaning they likely tend to connect to similar neurons.

We also see that the second factor appears to be frequency. This suggests an interesting prediction: perhaps if we were to split the layer into more than two branches, we’d also observe specialization in frequency in addition to color.

This seems like it may be true. For example, here we see a high-frequency black-and-white branch, a mid-frequency mostly black-and-white branch, a mid-frequency color branch, and a low-frequency color branch.

7. We constructed a small ImageNet model with the first layer split into four branches. The rest of the model is roughly an InceptionV1 architecture.

Parallels to neuroscience

We’ve shown that branch specialization is one example of a structural phenomenon — a larger-scale structure in a neural network. It happens in a variety of situations and neural network architectures, and it happens with consistency – certain motifs of specialization, such as color, frequency, and curves, happen consistently across different architectures and tasks.

Returning to our comparison with anatomy, although we hesitate to claim explicit parallels to neuroscience, it’s tempting to draw analogies between branch specialization and the existence of regions of the brain focused on particular tasks.
The visual cortex, the auditory cortex, Broca’s area and Wernicke’s area

The subspecialization within the V2 area of the primate visual cortex is another strong example from neuroscience. One type of stripe within V2 is sensitive to orientation or luminance, whereas the other type of stripe contains color-selective neurons.

We are grateful to Patrick Mineault for noting this analogy, and for further noting that the high-frequency features are consistent with some of the known representations of high-level features in the primate V2 area.

– these are all examples of brain areas with such consistent specialization across wide populations of people that neuroscientists and psychologists have been able to characterize as having remarkably consistent functions.

As researchers without expertise in neuroscience, we’re uncertain how useful this connection is, but it may be worth considering whether branch specialization can be a useful model of how specialization might emerge in biological neural networks.

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

Author Contributions

As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.

Research. The phenomenon of branch specialization was initially observed by Chris Olah. Chris also developed the weight PCA experiments suggesting that it implicitly occurs in non-branched models. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Ludwig Schubert, and Chris. Chelsea and Nick contributed to framing this work in the importance of larger scale structures on top of circuits.

Infrastructure. Branch specialization was only discovered because an early version of Microscope by Ludwig Schubert made it easy to browse the neurons that exist at certain layers. Michael Petrov, Ludwig and Nick built a variety of infrastructural tools which made our research possible.

Writing and Diagrams. Chelsea wrote the article, based on an initial article by Chris and with Chris’s help. Diagrams were illustrated by both Chelsea and Chris.


We are grateful to Brice Ménard for pushing us to investigate whether we can find larger-scale structures such as the one investigated here.

We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Matt Nolan, and Vincent Tjeng for their remarks on a first draft. We’re grateful to Patrick Mineault for noting the neuroscience comparison to subspecialization within primate V2.


  1. Imagenet classification with deep convolutional neural networks
    Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Advances in neural information processing systems, Vol 25, pp. 1097–1105.
  2. Visualizing higher-layer features of a deep network[PDF]
    Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341, pp. 3.
  3. Deep inside convolutional networks: Visualising image classification models and saliency maps
    Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034.
  4. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks[PDF]
    Nguyen, A., Yosinski, J. and Clune, J., 2016. arXiv preprint arXiv:1602.03616.
  5. Feature Visualizationhttps://distill.pub/2020/circuits/branch-specialization
    Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
  6. Going deeper with convolutions
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9.
  7. Neural architecture search with reinforcement learning
    Zoph, B. and Le, Q.V., 2016. arXiv preprint arXiv:1611.01578.
  8. Neural networks are surprisingly modular
    Filan, D., Hod, S., Wild, C., Critch, A. and Russell, S., 2020. arXiv preprint arXiv:2003.04881.
  9. Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
    Csordás, R., Steenkiste, S.v. and Schmidhuber, J., 2020.
  10. Segregation of form, color, and stereopsis in primate area 18https://distill.pub/2020/circuits/branch-specialization
    Hubel, D. and Livingstone, M., 1987. Journal of Neuroscience, Vol 7(11), pp. 3378–3415. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.07-11-03378.1987
  11. Representation of Angles Embedded within Contour Stimuli in Area V2 of Macaque Monkeyshttps://distill.pub/2020/circuits/branch-specialization
    Ito, M. and Komatsu, H., 2004. Journal of Neuroscience, Vol 24(13), pp. 3313–3324. Society for Neuroscience. DOI: 10.1523/JNEUROSCI.4364-03.2004

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.


Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.


For attribution in academic contexts, please cite this work as

Voss, et al., "Branch Specialization", Distill, 2021.

BibTeX citation

  author = {Voss, Chelsea and Goh, Gabriel and Cammarata, Nick and Petrov, Michael and Schubert, Ludwig and Olah, Chris},
  title = {Branch Specialization},
  journal = {Distill},
  year = {2021},
  note = {https://distill.pub/2020/circuits/branch-specialization},
  doi = {10.23915/distill.00024.008}


Source link


Leave a Reply

Your email address will not be published. Required fields are marked *