Simplifying Neural Networks by soft weight sharing

paper review

Published

Wednesday, June 22, 2022

Keywords

neural networks, weight sharing, pruning, weight decay

TL;DR

The primary aim of the paper (Nowlan and Hinton 1992) is reducing the complexity of neural networks by employing a mixture of Gaussian priors to the weights, creating a “soft” weight-sharing mechanism. Instead of simply penalizing large weights (as in L2 regularization), this method clusters the weights, allowing some to stay close to zero and others to remain non-zero, depending on their usefulness. Soft weight sharing along with weight decay, improving generalization and making the model more interpretable.

This paper is mentioned in Geoffrey Hinton’s Coursera course as a way to simplify neural networks. And the key takeaway is that of modeling the loss using a mixture of Gaussians to cluster the weights and penalize the complexity of the model.

Abstract

One way of simplifying neural networks so they generalize better is to add an extra term to the error function that will penalize complexity. Simple versions of this approach include penalizing the sum of the squares of the weights or penalizing the number of nonzero weights. We propose a more complicated penalty term in which the distribution of weight values is modeled as a mixture of multiple Gaussians. A set of weights is simple if the weights have high probability density under the mixture model. This can be achieved by clustering the weights into subsets with the weights in each cluster having very similar values. Since we do not know the appropriate means or variances of the clusters in advance, we allow the parameters of the mixture model to adapt at the same time as the network learns. Simulations on two different problems demonstrate that this complexity term is more effective than previous complexity terms

(Nowlan and Hinton 1992)

The perplexing idea of clustering weights and some associated quandaries

This notion of clustering weights is odd to say the least as these are just numbers in a data structure. Viewed as a method to reduce the effective number of parameters in the model, then begins to make more sense. What this idea seems to boil down to is that we are prioritizing neural net architectures with some abstract symmetry in the weights and thus a lower capacity and thus less prone to overfitting.

A few quandaries then arise:

  1. How can we figure for different layers having weights, gradients and learning rates being more correlated then between layers.
  2. That there may be other structure so that the weights are not independent of each other.
    1. In classifiers the are continuous approximation of logic gates.
    2. In regression settings their values approximate continuous variables ?
  3. In many networks most of the weights are in the last layer, so we can use a different penalty for the last layer.
  4. Is there a way to impose an abstract symmetry on the weights of a neural network such that is commensurate with the problem?
  5. Can we impose multiple such symmetries on the network to give it other advantages?
    • Invariance to certain transformations,
    • using it for initialization,
    • making the model more interpretable,
    • Once we learn these mixture distribution of weights, can we use its parameters in, batch normalization, layer norm and with other regularization techniques like dropout?

The problem:

This main problem in this paper is that of supervised ML

How to train a model so it will generalize well on unseen data?

In deep learning this problem is exacerbated by the fact that neural networks require fitting lots of parameters while the data for training is limited. This naturally leads to overfitting - memorizing the data and noise rather than learning the underlying data generating process.

The paper

paper

Resources

An after thought

Can we use a Bayesian RL to tune the hyper-parameters of model and dataset. We can perhaps create an RL alg that controls the many aspects of training of a model. It can explore/exploit different setups on subsets of the data. Find variants that converge faster and are more robust by adding constraints at different levels. It can identify problems in the datasets (possible bad labels etc) . Ensambles, mixtures of experts, different regularization strategies. Different Learning rates and schedules globaly or per layer.

References

Lang, Kevin J, Alex H Waibel, and Geoffrey E Hinton. 1990. “A Time-Delay Neural Network Architecture for Isolated Word Recognition.” Neural Networks 3 (1): 23–43.
LeCun, Yann, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. 1990. “Optimal Brain Damage.” In Advances in Neural Information Processing Systems (NIPS 1989), edited by David Touretzky. Vol. 2. Denver, CO: Morgan Kaufman. https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf.
MacKay, David JC. 1991. “Bayesian Modeling and Neural Networks.” PhD Thesis, Dept. Of Computation and Neural Systems, CalTech. https://www.inference.org.uk/mackay/thesis.pdf.
Morgan, Nelson, and Hervé Bourlard. 1989. “Generalization and Parameter Estimation in Feedforward Nets: Some Experiments.” Advances in Neural Information Processing Systems 2.
Mozer, Michael C, and Paul Smolensky. 1989. “Using Relevance to Reduce Network Size Automatically.” Connection Science 1 (1): 3–16.
Nowlan, Steven J., and Geoffrey E. Hinton. 1992. Simplifying Neural Networks by Soft Weight-Sharing.” Neural Computation 4 (4): 473–93. https://doi.org/10.1162/neco.1992.4.4.473.
Plaut, D. C., S. J. Nowlan, and G. E. Hinton. 1986. “Experiments on Learning by Back-Propagation.” CMU-CS-86-126. Pittsburgh, PA: Carnegie–Mellon University. https://ni.cmu.edu/~plaut/papers/pdf/PlautNowlanHinton86TR.backprop.pdf.
Weigend, Andreas S, Bernardo A Huberman, and David E Rumelhart. 1990. “Predicting the Future: A Connectionist Approach.” International Journal of Neural Systems 1 (03): 193–209.

Footnotes

  1. in the paper this citation is ambiguous, but I think this is the correct one - based on the abstract↩︎

  2. note: this answers the first of my question above.↩︎