Deep Neural Networks — Readings II for Lesson 10

For a course by Geoffrey Hinton on Coursera

Review & summary of — Improving neural networks by preventing co-adaptation of feature detectors

deep learning

neural networks

notes

coursera

dropout

paper

review

Reading Improving neural networks by preventing co-adaptation of feature detectors

In the paper (Hinton et al. 2012) the authors discuss using dropout as a regularization mechanism to reduce overfitting in deep neural networks.

Abstract

When a large feed forward neural network is trained on a small training set, it typically performs poorly on held-out test data. This “overfitting” is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorically large variety of internal contexts in which it must operate. Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

My notes

This is a paper about using dropout to as a regularization tool, to prevent nodes co-adaptation within parts of the neural network. As I see it if the network has sufficient capacity it will memorize all the training data and then will perform rather poorly on the holdout data and in real world inference. What happen during overfitting is that the network learn both the signal and the noise. In general the law of number works in our favour and the network and since the signal is stronger than the noise we do not initially overfit. However, as the remaining unlearned signal becomes more rare it becomes harder for the model to separate if from the noise. Rare signals will tend to appear less often than certain common noise patterns. Most regularization techniques try to boost the signal. In this case by effectively reducing the capacity and creating, and making the network overall less cohesive. Dropout effectively reduces the network’s capacity during training. It forces the network to create redundent components which relay less on other units. Another regularization is also used: instead of using L2 on the weights vector, L2 norm penalty is used on each weight. If the weight updates violates the constraints, they are normalized. This is motivated by a wish to start with a high learning rate which would otherwise lead to very large weights. This should intuitively allow the net to initially benefit from the stronger signal while reserving more opportunity for later epochs to leave their mark.
At trainng time the full network is used nut the Tha authors claim that dropout is equivilent to avareging many random networks. A point they fail to mention is that
“Dropout is considerably simpler to implement than Bayesian model averaging which weights each model by its posterior probability given the training data. For complicated model classes, like feed forward neural networks, Bayesian methods typically use a Markov chain Monte Carlo method to sample models from the posterior distribution (14). By contrast, dropout with a probability of 0.5 assumes that all the models will eventually be given equal importance in the combination but the learning of the shared weights takes this into account.”

My thoughts are that we should be able to do better than this version of dropout.

Shortcoming:
Dropout on units can render the net very poor.
Drop out slows training down - since we don’t update half the units and probably a large number of the weights.
For different networks (CNN, RNN, etc) drop out might work better on units that correspond to larger structures.
We should track dropout related stats to better understand the confidence of the model.
A second idea is that the gated network of expert used a neural network to assign each network to its expert. If we want the network to make better use of its capacity, perahps we should introduce some correlation between the dropout nodes and the data. Could we develop a gated dropout?

Start with some combinations \binom k n of the weights. where k = | {training\; set}|*{minibatch\_size}. We use the same dropout for each mini-batch, then switch.
Each epoch we should try to switch our mini-batches. We may want to start with maximally heterogenous batches. We may want in subsequent epochs to pick more heterogenous batches. We should do this by shuffling the batches. We might want to shuffle by taking out a portion of the mini-batch inversely proportional to its error rate, shuffle and return. So that the worst mini-batches would get changed more often. We could ?
When we switch we can shuffle different We score the errors per mini-batch dropout combo and try to reduce the error by shuffling between all mini-batches with similar error rates. The lower the error the smaller the shuffles. In each epoch we want to assign to each combination a net.
Ideally we would like learn how to gate training cases to specific dropouts or to dropout that are within certain symmetry groups of some known dropouts. (i.e. related/between a large number of dropout-combos.). In the “full bayesian learning” we may want to learn a posterior distribution To build a correlation matrix between the training case and the dropout combo. If there was a structure like an orthogonal array for each we might be able to collect this kind of data in a minimal set of step.
We could use abstract algebra e.g. group theory to design a network/dropout/mini-batching symmetry mechanism.
We should construct a mini-batch shuffle group and a drop out group or a ring. We could also select an architecture that makes sense for the

Further Readind

c.f. Gal and Ghahramani (2016)

My wrap up 🎬

Game theoretic framework have to formalize cooperative and competitive aspects of learning and how these might influence network architectures.
- c.f. David Balduzzi (2015) Semantics, Representations and Grammars for Deep Learning.

There has been lots of progress in training single models for multiple tasks. - c.f. Lukasz Kaiser et all. (2017) One Model To Learn Them All. - covered in this video: One Neural network learns EVERYTHING?! which uses mixture of expert layer which come from later work: Noam Shazeer, Azalia Mirhoseini,Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean (2017) - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer in which mixture of experts is used within large neural networks

References

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:1050–59. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. https://proceedings.mlr.press/v48/gal16.html.

Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” https://doi.org/https://doi.org/10.48550/arXiv.1207.0580.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} -\/-\/- {Readings} {II} for {Lesson}
    10},
  date = {2017-10-03},
  url = {https://orenbochman.github.io/notes/dnn/dnn-10/r2.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2017. “Deep Neural Networks --- Readings II for Lesson 10.” October 3, 2017. https://orenbochman.github.io/notes/dnn/dnn-10/r2.html.