A list of RL-papers and resources that I have read or plan to read.
I think that RL is fascinating and I have been reading a lot about it recently.
- Recently I took a number of courses in RL in the Coursera platform.
- The teachers were Martha and Adam White and the courses used Barto and Sutton’s book, which is a classic.
- Read a large part of the book and still plan to read some more.
- RL is currently a rapidly evolving field many modern algorithm are not covered in the book.
- The basic building blocks however are mostly the same and the course taught each in its simplest setting, which is the best way to understand them, according to Adam White.
- Later many of the ideas are combined in new and often more challenging problem settings.
- I have been reading papers and watching videos on the internet to try and keep up with the field.
My approach to the book is is reductionist. I try and break down the algorithms into their innovating parts and then try and understand the parts. This seems a good fit to the book and courses that I took.
However, in papers the authors are trying to present their work in the best light. They typically present lost of references to other work, which seem like new ideas to all but the most expert reader. They also present lots of ideas, which are not actually realized in the paper. The reality is that most papers in the end are about one new innovation.
I recently made a time line of the RL and bandit algorithms that I had learned. This was an interesting exercise and I got to see that indeed one a new idea is introduced it is quickly picked up by other researchers who try to apply it to other problems, other algorithms, or different settings. There are many setting in RL and as Adam White explains in “Fundamentals of Reinforcement Learning” it is best to understand new ideas in the simplest setting possible.
The main thrusts of RL seem to be
- Finding formulations of problems that can be applied to many different settings
- Parametrizing these as differnt trade-offs for the algorithm
- gamma - discounting factor
- epsilon - to control exploration
- alpha - learning rate (but adaptive schedules are better)
- lambda - eligibility trace ??
- n-step - how many steps to back up
- Aggregation type in the update rule (e.g. sum, max, average, weighted average,minimax, moving average)
- First/any-visit - for the Monte-Carlo methods
- Behavioral policy (exploratory) target policy (exploitative) - for off policy learning
- Weighted/Unweighted/doubly robust/ for importance sampling
- Finding algorithms that can be applied to many different settings
However a second circle of ideas is covered in the course and book. These too are worth listing here. If only to create a checklist of things to consider when implementing an RL system or coming up with a new algorithm.
- Intrinsic motivation in RL agents
- Use of Utility functions as motivation for agents based on preferences
- Use of Curiosity as motivation for agents based on novelty
- Use of Empowerment as motivation for agents based on control
- Use of Causal Influence in constructing intrinsic social motivation
- The reward hypothesis - that all goals can be described by the maximization of the expected reward c.f. (Silver et al. 2021)
- Alternatively the idea that the agent might benefit from multiple reward signals.
- Can I get a cookie cutter function to
- combine multiple reward signals
- to a scalar reward signal
- to a tensor reward signal with an attention mechanism that can retrieve the most relevant reward signal
- to a pareto optimal signal
- learn using multiple reward signals ?
- combine multiple reward signals
- The notion of accelerating learning by prioritizing the updates based in planning.
- can we do prioritized sweeps in non deterministic environments ?
- what about continuous state spaces ?
- The idea of using a model to simulate the environment and improve the policy.
- The idea of doubly robust estimators for correct importance sampling of small trajectory samlples in off policy learning. This could be the cornerstone of few shot learning algorithms.
- Covariate Shift - the idea that the distribution of states changes as the policy changes.
- This is a problem for off policy learning.
- This may also be an issue in actor critic methods where the policy is updated based on the value function.
Hinton’s in his course and in (Nowlan and Hinton 1992) at paper describes his approach where by initilizing neural networks with zero values this in effect creates a network that that can grow in capacity as it learns. This can continue until all the weights are non-zero.
Two other ideas - weight sharing and sparsity inducing penalties and constraints can also be used to control the capacity of the network.
This idea is also of some interest to RL where different function approximation methods use neural networks to approximate value functions, V,Q or the policy pi. Modle based methods also use neural networks to approximate the model of the environment which typicaly is a state transition function M(s’|a,s) and a reward function R(s’|a,s) though a four part dynamics model can also be used p(s’,r|a,s).
A central issue is that that when approximating these on continuous state the functions treat regions of the state space as equivalent. This can creates generalization when states are very similar (same state, highly correlated or have high mutual information) But so long as the function approximator fails to distinguish between states that are very different there will be a problem. The above ideas on controlling the capacity of the network could be used to allow the capacity of the network to grow as the agent learns thus perhaps allowing coarse features to be learned first and then more fine grained features to be learned later.
We see that this dependence between states is a breaking point to translating algorithms for tabular methods to setting with function approximation.
the V,Q functions are approximated by neural networks. In the course Adam White describes how the capacity of the network. In the course Adam White describes how the capacity of the network can be controlled by the number of hidden units in the network.
He also discuses how we can slow this down by using a sparsity inducing penalty. These are L1 and L2 regularizations terms on the weights.
As I recall weights and gradients in neural networks get smaller as the network gets deeper. This is because the gradients are multiplied by the weights in the backpropagation algorithm. This can lead to the vanishing gradient problem. One way to solve this is to use a sparsity inducing penalty on the weights. This can be done by adding a term to the loss function that encourages the weights to be zero. This is a form of regularization that is different from the L1 and L2 regularization that we have seen before.
But today I was thinking that we might use a per layers multiplier on the weights so as to rescale the weights and their gradients. This could improve numerical stability and help to avoid the vanishing gradient problem.
encourage them to be zero. This is a form of regularization that is different from
there are about 42 papers that are covered in the course.
some I have already read
others seen to be on robotics
others are on deep learning but may be dated by now.
is there a more recent version of the course ?
10-403 Deep RL - Spring 2024 seems less daunting and covers more algorithms I found missing from the Coursera specialization.
- Natural Policy Gradient - PG
- Proximal Policy Optimization - PPO
- Trust Region Policy Optimization - TRPO (Schulman et al. 2015) paper
OpenAI Gym tutorial notebook
Some papers
- DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker
- AlphaGoZero
- AlphaMuZero
- AlphaStar
Making algrothims that can transfer to different problems
IMPALA algorithm
Using Impala with Pixels
Rainbow - Combining Improvements in Deep Reinforcement Learning Combine the following DQN variants to make Rainbow variant.
- DDQN fix bias in DQN by decoupling selection and evaluation of the bootstrap action
- Prioritized experience replay - improves data efficiency, by replaying more often transitions from which there is more to learn.
- The dueling network architecture - helps to generalize across actions by separately representing state values and action advantages
- Learning from multi-step bootstrap targets A3C - shifts the bias-variance trade off and helps to propagate newly observed rewards faster to earlier visited states.
- Distributional RL - learns a distribution over returns rather than the expected return. This can help to capture the uncertainty in the value estimates and can lead to better policies.
- Noisy DQN uses stochastic network layers for exploration.
- Distributional Q-learning - learns a categorical distribution of discounted returns, instead of estimating the mean
PopArt - Normalizing for multiple goals with multiple time scales
R2D2 - Recurrent Experience Replay in Distributed Reinforcement Learning
PAIRED - Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design
social influence as intrinsic motivation for multiagent deep reinforce
https://openai.com/index/meta-learning-for-wrestling/
JaxMARL: Multi-Agent RL Environments in JAX
https://ai.stanford.edu/users/nir/Papers/DFR1.pdf
some ideas
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Readings in Rl},
date = {2024-06-18},
url = {https://orenbochman.github.io/posts/2024/2024-06-18-readings-in-rl/},
langid = {en}
}