Lecture 3e: Using the derivatives computed by backpropagation

The backpropagation algorithm is an efficient way of computing the error derivative \frac{dE}{dw} for every weight on a single training case. There are many decisions needed on how to derive new weights using there derivatives.
- Optimization issues: How do we use the error derivatives on individual cases to discover a good set of weights? (lecture 6)
- Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training? (lecture 7)
We now have a very brief overview of these two sets of issues.
How often to update weights ?
- Online - after every case.
- Mini Batch - after a small sample of training cases.
- Full Batch - after a full sweep of training data.
How much to update? (c.f. lecture 6)
- fixed learning rate
- adaptable global learning rate
- adaptable learning rate per weight
- don’t use steepest descent (velocity/momentum/second order methods)

Overfitting: The downside of using powerful models

Regularization - How to ensure that learned weights work well for cases we did not see during training?
The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise.
- The target values may be unreliable (usually only a minor worry).
- There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen.
When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.
- So it fits both kinds of regularity.
- If the model is very flexible it can model the sampling error really well. This is a disaster.

A simple example of overfitting

Which output value should you predict for this test input?
Which model do you trust?
- The complicated model fits the data better.
- But it is not economical.
A model is convincing when it fits a lot of data surprisingly well.
- It is not surprising that a complicated model can fit a small amount of data well.
- Models fit both signal and noise.

How to reduce overfitting

A large number of different methods have been developed to reduce overfitting.
- Weight-decay
- Weight-sharing - reduce model flexibility by adding constraints on weights
- Early stopping - stop training when by monitoring the Test error.
- Model averaging - use an ensemble of models
- Bayesian fitting of neural nets - like averaging but weighed
- Dropout - (hide data from half the net)
- Generative pre-training - (more data)
Many of these methods will be described in lecture 7.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} - {Notes} for Lecture 3e},
  date = {2017-08-06},
  url = {https://orenbochman.github.io/notes/dnn/dnn-03/l03e.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2017. “Deep Neural Networks - Notes for Lecture 3e.” August 6, 2017. https://orenbochman.github.io/notes/dnn/dnn-03/l03e.html.