Lecture 7b: Training RNNs with back propagation

Most important prerequisites to perhaps review: videos 3d and 5c (about backprop with weight sharing).

After watching the video, think about how such a system can be used to implement the brain of a robot as it’s producing a sentence of text, one letter at a time.

What would be input; what would be output; what would be the training signal; which units at which time slices would represent the input & output?

The equivalence between feedforward nets and recurrent nets

Assume that there is a time delay of 1 in using each connection.

The recurrent net is just a layered net that keeps reusing the same weights.

Reminder: Backpropagation with weight constraints

It is easy to modify the backprop algorithm to incorporate linear constraints between the weights.
We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints.
- So if the weights started off satisfying the constraints, they will continue to satisfy them.

Backpropagation through time

We can think of the recurrent net as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints.
We can also think of this training algorithm in the time domain:
- The forward pass builds up a stack of the activities of all the units at each time step.
- The backward pass peels activities off the stack to compute the error derivatives at each time step.
- After the backward pass we add together the derivatives at all the different times for each weight.

An irritating extra issue

We need to specify the initial activity state of all the hidden and output units.
We could just fix these initial states to have some default value like 0.5.
But it is better to treat the initial states as learned parameters.
We learn them in the same way as we learn the weights.
- Start off with an initial random guess for the initial states.
- At the end of each training sequence, backpropagate through time all the way to the initial states to get the gradient of the error function with respect to each initial state.
- Adjust the initial states by following the negative gradient.

Providing input to recurrent networks

We can specify inputs in several ways:
- Specify the initial states of all the units.
- Specify the initial states of a subset of the units.
- Specify the states of the same subset of the units at every time step.
This is the natural way to model most sequential data.

Teaching signals for recurrent networks

We can specify targets in several ways:
- Specify desired final activities of all the units
- Specify desired activities of all units for the last few steps
Good for learning attractors
It is easy to add in extra error derivatives as we backpropagate.
- Specify the desired activity of a subset of the units.
The other units are input or hidden units.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} - {Notes} for {Lesson} 7b},
  date = {2017-09-03},
  url = {https://orenbochman.github.io/notes/dnn/dnn-07/l07b.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2017. “Deep Neural Networks - Notes for Lesson 7b.” September 3, 2017. https://orenbochman.github.io/notes/dnn/dnn-07/l07b.html.