Lecture 6a: Overview of mini-batch gradient descent
Now we’re going to discuss numerical optimization: how best to adjust the weights and biases, using the gradient information from the backprop algorithm
. This video elaborates on the most standard neural net optimization algorithm (mini-batch gradient descent), which we’ve seen before. We’re elaborating on some issues introduced in video 3e.
Lecture 6b: A bag of tricks for mini-batch gradient descent
initializing weights: we must not initialize units with equal weights as they can never become different. we cannot use zero as it will remain zero we want to avoid explosion and vanishing weights fan in - the number of inputs
Part 1 is about transforming the data to make learning easier. At 1:10, there’s a comment about random weights and scaling. The “it” in that comment is the average size of the input to the unit. At 1:15, the “good principle”: what he means is INVERSELY proportional. At 4:38, Geoff says that the hyperbolic tangent is twice the logistic minus one. This is not true, but it’s almost true. As an exercise, find out’s missing in that equation. At 5:08, Geoffrey suggests that with a hyperbolic tangent unit, it’s more difficult to sweep things under the rug than with a logistic unit. I don’t understand his comment, so if you don’t either, don’t worry. This comment is not essential in this course: we’re never using hyperbolic tangents in this course. Part 2 is about changing the stochastic gradient descent algorithm in sophisticated ways. We’ll look into these four methods in more detail, later on in the course. Jargon: “stochastic gradient descent” is mini-batch or online gradient descent. The term emphasizes that it’s not full-batch gradient descent. “stochastic” means that it involves randomness. However, this algorithm typically does not involve randomness. However, it would be truly stochastic if we would randomly pick 100 training cases from the entire training set, every time we need the next mini-batch. We call traditional “stochastic gradient descent” stochastic because it is, in effect, very similar to that truly stochastic version. Jargon: a “running average” is a weighted average over the recent past, where the most recent past is weighted most heavily.
Lecture 6c: The momentum method
Drill down into momentum mentioned before.
The biggest challenge in this video is to think of the error surface as a mountain landscape. If you can do that, and you understand the analogy well, this video will be easy. You may have to go back to video 3b, which introduces the error surface. Important concepts in this analogy: “ravine”, “a low point on the surface”, “oscillations”, “reaching a low altitude”, “rolling ball”, “velocity”. All of those have meaning on the “mountain landscape” side of the analogy, as well as on the “neural network learning” side of the analogy. The meaning of “velocity” in the “neural network learning” side of the analogy is the main idea of the momentum method. Vocabulary: the word “momentum” can be used with three different meanings, so it’s easy to get confused. It can mean the momentum method for neural network learning, i.e. the idea that’s introduced in this video. This is the most appropriate meaning of the word. It can mean the viscosity constant (typically 0.9), sometimes called alpha, which is used to reduce the velocity. It can mean the velocity. This is not a common meaning of the word. Note that one may equivalently choose to include the learning rate in the calculation of the update from the velocity, instead of in the calculation of the velocity.
Lecture 6d: Adaptive learning rates for each connection
This is really “for each parameter”, i.e. biases as well as connection strengths. Vocabulary: a “gain” is a multiplier. This video introduces a basic idea (see the video title), with a simple implementation. In the next video, we’ll see a more sophisticated implementation. You might get the impression from this video that the details of how best to use such methods are not universally agreed on. That’s true. It’s research in progress.
Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude
This is another method that treats every weight separately. rprop uses the method of video 6d, plus that it only looks at the sign of the gradient. Make sure to understand how momentum is like using a (weighted) average of past gradients. Synonyms: “moving average”, “running average”, “decaying average”. All of these describe the same method of getting a weighted average of past observations, where recent observations are weighted more heavily than older ones. That method is shown in video 6e at 5:04. (there, it’s a running average of the square of the gradient) “moving average” and “running average” are fairly generic. “running average” is the most commonly used phrase. “decaying average” emphasizes the method that’s used to compute it: there’s a decay factor in there, like the alpha in the momentum method.
Reuse
Citation
@online{bochman2017,
author = {Bochman, Oren},
title = {Deep {Neural} {Networks} - {Notes} for {Lesson} 6},
date = {2017-08-23},
url = {https://orenbochman.github.io/notes/dnn/dnn-06/l_06.html},
langid = {en}
}