readings in rl

reinforcement learning

Oren Bochman


Tuesday, June 18, 2024

A list of RL-papers and resources that I have read or plan to read.

I think that RL is fascinating and I have been reading a lot about it recently.

My approach to the book is is reductionist. I try and break down the algorithms into their innovating parts and then try and understand the parts. This seems a good fit to the book and courses that I took.

However, in papers the authors are trying to present their work in the best light. They typically present lost of references to other work, which seem like new ideas to all but the most expert reader. They also present lots of ideas, which are not actually realized in the paper. The reality is that most papers in the end are about one new innovation.

I recently made a time line of the RL and bandit algorithms that I had learned. This was an interesting exercise and I got to see that indeed one a new idea is introduced it is quickly picked up by other researchers who try to apply it to other problems, other algorithms, or different settings. There are many setting in RL and as Adam White explains in “Fundamentals of Reinforcement Learning” it is best to understand new ideas in the simplest setting possible.

The main thrusts of RL seem to be

  1. Finding formulations of problems that can be applied to many different settings
  2. Parametrizing these as differnt trade-offs for the algorithm
  3. gamma - discounting factor
  4. epsilon - to control exploration
  5. alpha - learning rate (but adaptive schedules are better)
  6. lambda - eligibility trace ??
  7. n-step - how many steps to back up
  8. Aggregation type in the update rule (e.g. sum, max, average, weighted average,minimax, moving average)
  9. First/any-visit - for the Monte-Carlo methods
  10. Behavioral policy (exploratory) target policy (exploitative) - for off policy learning
  11. Weighted/Unweighted/doubly robust/ for importance sampling
  12. Finding algorithms that can be applied to many different settings

However a second circle of ideas is covered in the course and book. These too are worth listing here. If only to create a checklist of things to consider when implementing an RL system or coming up with a new algorithm.

  1. Intrinsic motivation in RL agents
  1. The reward hypothesis - that all goals can be described by the maximization of the expected reward c.f. (Silver et al. 2021)
    • Alternatively the idea that the agent might benefit from multiple reward signals.
    • Can I get a cookie cutter function to
      • combine multiple reward signals
        • to a scalar reward signal
        • to a tensor reward signal with an attention mechanism that can retrieve the most relevant reward signal
        • to a pareto optimal signal
      • learn using multiple reward signals ?
  2. The notion of accelerating learning by prioritizing the updates based in planning.
Silver, David, Satinder Singh, Doina Precup, and Richard S. Sutton. 2021. “Reward Is Enough.” Artificial Intelligence 299: 103535.
  1. The idea of using a model to simulate the environment and improve the policy.
  2. The idea of doubly robust estimators for correct importance sampling of small trajectory samlples in off policy learning. This could be the cornerstone of few shot learning algorithms.
  3. Covariate Shift - the idea that the distribution of states changes as the policy changes.
    • This is a problem for off policy learning.
    • This may also be an issue in actor critic methods where the policy is updated based on the value function.

Hinton’s in his course and in (Nowlan and Hinton 1992) at paper describes his approach where by initilizing neural networks with zero values this in effect creates a network that that can grow in capacity as it learns. This can continue until all the weights are non-zero.

Nowlan, S., and Geoffrey E. Hinton. 1992. “Simplifying Neural Networks by Soft Weight-Sharing.” Neural Computation 4: 473–93.

Two other ideas - weight sharing and sparsity inducing penalties and constraints can also be used to control the capacity of the network.

This idea is also of some interest to RL where different function approximation methods use neural networks to approximate value functions, V,Q or the policy pi. Modle based methods also use neural networks to approximate the model of the environment which typicaly is a state transition function M(s’|a,s) and a reward function R(s’|a,s) though a four part dynamics model can also be used p(s’,r|a,s).

A central issue is that that when approximating these on continuous state the functions treat regions of the state space as equivalent. This can creates generalization when states are very similar (same state, highly correlated or have high mutual information) But so long as the function approximator fails to distinguish between states that are very different there will be a problem. The above ideas on controlling the capacity of the network could be used to allow the capacity of the network to grow as the agent learns thus perhaps allowing coarse features to be learned first and then more fine grained features to be learned later.

We see that this dependence between states is a breaking point to translating algorithms for tabular methods to setting with function approximation.

the V,Q functions are approximated by neural networks. In the course Adam White describes how the capacity of the network. In the course Adam White describes how the capacity of the network can be controlled by the number of hidden units in the network.

He also discuses how we can slow this down by using a sparsity inducing penalty. These are L1 and L2 regularizations terms on the weights.

As I recall weights and gradients in neural networks get smaller as the network gets deeper. This is because the gradients are multiplied by the weights in the backpropagation algorithm. This can lead to the vanishing gradient problem. One way to solve this is to use a sparsity inducing penalty on the weights. This can be done by adding a term to the loss function that encourages the weights to be zero. This is a form of regularization that is different from the L1 and L2 regularization that we have seen before.

But today I was thinking that we might use a per layers multiplier on the weights so as to rescale the weights and their gradients. This could improve numerical stability and help to avoid the vanishing gradient problem.

encourage them to be zero. This is a form of regularization that is different from

there are about 42 papers that are covered in the course.

Schulman, John, S. Levine, P. Abbeel, Michael I. Jordan, and Philipp Moritz. 2015. “Trust Region Policy Optimization.” ArXiv abs/1502.05477.
Russo, Daniel, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. 2017. “A Tutorial on Thompson Sampling.” ArXiv abs/1707.02038.

Some papers

Making algrothims that can transfer to different problems

  • IMPALA algorithm

  • Using Impala with Pixels

  • Rainbow - Combining Improvements in Deep Reinforcement Learning Combine the following DQN variants to make Rainbow variant.

    1. DDQN fix bias in DQN by decoupling selection and evaluation of the bootstrap action
    2. Prioritized experience replay - improves data efficiency, by replaying more often transitions from which there is more to learn.
    3. The dueling network architecture - helps to generalize across actions by separately representing state values and action advantages
    4. Learning from multi-step bootstrap targets A3C - shifts the bias-variance trade off and helps to propagate newly observed rewards faster to earlier visited states.
    5. Distributional RL - learns a distribution over returns rather than the expected return. This can help to capture the uncertainty in the value estimates and can lead to better policies.
    6. Noisy DQN uses stochastic network layers for exploration.
    7. Distributional Q-learning - learns a categorical distribution of discounted returns, instead of estimating the mean
  • PopArt - Normalizing for multiple goals with multiple time scales

  • R2D2 - Recurrent Experience Replay in Distributed Reinforcement Learning

  • PAIRED - Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

  • social influence as intrinsic motivation for multiagent deep reinforce


  • JaxMARL: Multi-Agent RL Environments in JAX


some ideas


BibTeX citation:
  author = {Bochman, Oren},
  title = {Readings in Rl},
  date = {2024-06-18},
  url = {},
  langid = {en}
For attribution, please cite this work as:
Bochman, Oren. 2024. “Readings in Rl.” June 18, 2024.