This paper was mentioned in Geoffrey Hinton’s Coursera course as a way to simplify neural networks.
The main takeaway is that of modeling the loss using a mixture of Gaussians to cluster the weights and penalize the complexity of the model.
TL;DR
The primary aim of the paper (Nowlan and Hinton 1992) is reducing the complexity of neural networks by employing a mixture of Gaussian priors to the weights, creating a “soft” weight-sharing mechanism. Instead of simply penalizing large weights (as in L2 regularization), this method clusters the weights, allowing some to stay close to zero and others to remain non-zero, depending on their usefulness. Soft weight sharing along with weight decay, improving generalization and making the model more interpretable.
Abstract
One way of simplifying neural networks so they generalize better is to add an extra term to the error function that will penalize complexity. Simple versions of this approach include penalizing the sum of the squares of the weights or penalizing the number of nonzero weights. We propose a more complicated penalty term in which the distribution of weight values is modeled as a mixture of multiple Gaussians. A set of weights is simple if the weights have high probability density under the mixture model. This can be achieved by clustering the weights into subsets with the weights in each cluster having very similar values. Since we do not know the appropriate means or variances of the clusters in advance, we allow the parameters of the mixture model to adapt at the same time as the network learns. Simulations on two different problems demonstrate that this complexity term is more effective than previous complexity terms
This notion of clustering weights is odd to say the least as these are just numbers in a data structure. Viewed as a method to reduce the effective number of parameters in the model, it makes some convoluted sense. What this idea seems to boil down to is that we are prioritizing neural net architectures with some abstract symmetry in the weights and thus a lower capacity and thus less prone to overfitting.
- We shall shall soon see that the authors have attempted to motivate this idea in at least two ways:
- Weight decay - the penalty is a function of the weights themselves based on (Plaut, Nowlan, and Hinton 1986)
- A Bayesian perspective - is a negative log density of the weights under a Gaussian prior.
- it might also help if we learned that mixture models are often used to do clustering in unsupervised learning.
A few quandaries then arise:
- How can we figure for different layers having weights, gradients and learning rates being more correlated then between layers.
- That there may be other structure so that the weights are not independent of each other.
- In classifiers the are continuous approximation of logic gates.
- In regression settings their values approximate continuous variables ?
- In many networks most of the weights are in the last layer, so we can use a different penalty for the last layer.
- Is there a way to impose an abstract symmetry on the weights of a neural network such that is commensurate with the problem?
- Can we impose multiple such symmetries on the network to give it other advantages?
- Invariance to certain transformations,
- using it for initialization,
- making the model more interpretable,
- Once we learn these mixture distribution of weights, can we use its parameters in, batch normalization, layer norm and with other regularization techniques like dropout?
The problem:
This main problem in this paper is that of supervised ML
How to train a model so it will generalize well on unseen data?
In deep learning this problem is exacerbated by the fact that neural networks require fitting lots of parameters while the data for training is limited. This naturally leads to overfitting - memorizing the data and noise rather than learning the underlying data generating process.
The paper
Resources
- Article using weight constraints to reduce generalization
- The paper is available at https://www.cs.utoronto.ca/~hinton/absps/sunspots.pdf
An after thought
Can we use a Bayesian RL to tune the hyper-parameters of model and dataset. We can perhaps create an RL alg that controls the many aspects of training of a model. It can explore/exploit different setups on subsets of the data. Find variants that converge faster and are more robust by adding constraints at different levels. It can identify problems in the datasets (possible bad labels etc) . Ensambles, mixtures of experts, different regularization strategies. Different Learning rates and schedules globaly or per layer.
Citation
@online{bochman2022,
author = {Bochman, Oren},
title = {Simplifying {Neural} {Networks} by Soft Weight Sharing},
date = {2022-06-22},
url = {https://orenbochman.github.io/reviews/1991/simplifing-NN-by-soft-weight-sharing/},
langid = {en}
}