As I progressed though the RL specialization I came to realize that the research is mostly about coming up with a new algorithm that works better. This is tricky and so in many cases what happens is that people end up considering some narrow setting better yet is it is new and so they can publish a paper claiming novelty. On the other hand almost all algorithms revolve around a rather small set of techniques. One might call these techniques ‘magic tricks’ to avoid the fact that they are also grounded in computer science and mathematics. Some researches show thier algorithms converge better.
Outside academia RL is a tool. We don’t have quick simulators but we do have challenging examples. I notice that most people use a few algorithms rather than coding thier own from scratch. This makes a lot of sense if out problem is very similar to other problems that the algorithm has been tested on. But what if our problem is different? We can try a bunch of algorithms and see which one works best. But what if they are mediocre?
Another approach which I consider continuing the approach espoused in (CiationNeeded?) The Algorithm Design Manual by Steven S. Skiena and (CiationNeeded?) Algorithm Design by Jon Kleinberg and Éva Tardos.
- We have many bits and pieces of RL algorithms that do all sorts of magic tricks
- WE have different problem settings that need different tricks to work and in which other tricks don’t work
I imagine a research algorithm that has all these tricks available in a modular way and that can adaptively reconfigure itself to the problem at hand.
And one more thing it would then expose its hyperparameters to be tuned by Optuna or some other hyperparameter optimization tool….
All this power suggest the one ring to rule them all so perhaps this might be the master algorithm.
Magical Tricks:
From the domain of Bandits
epsilon greedy exploration
upper confidence bound exploration
thompson sampling exploration
gittins index exploration
‘stateless’ setting
note: it appears we might want the setting to be a configuration parameter
MDPs
Bellman Equation for V Value Function
Bellman Equation for Q Action Value Function
Bellman Optimality Equation for V Value Function
Bellman Optimality Equation for Q Action Value Function
Bellman Equation for Average Reward
Bellman Optimality Equation for Average Reward
discounted rewards gamma trick
average reward formulation
Exploration using - Optimistic Initialisation
- algorithmic tricks
- value iteration
- policy iteration
- generalized value iteration
- generalized policy iteration
- tabular setting
- limited horizon setting
- infinite horizon setting
Montecarlo Methods
First visit Montecarlo
Every visit Montecarlo
Monte Carlo target
episodic setting
in the continuos setting Monte Carlo require an eligibility traces, a replay buffer or time limit i.e. temporal abstraction.
Temporal Difference Learning
td targets for V Value Function
td targets for Q Action Value Function
bootstrapping
SARSA
Q-Learning
Expected SARSA
Model Based RL
replay buffer
proretised replay buffer
prioritised sweeping
Dyna-Q
Dyna-Q+
Temporal Aggregation
SMDP (semi-markov decision processes)
options (temporal abstraction)
gvf (generalized value functions)
meta gradients - learning Rewards over multiple agent lifetime
‘liflelong’ setting
Partial Observability
- POMDPs
Multi-Agent RL
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Engineering {Reinforcement} {Learning} {Algorithms}},
date = {2024-01-10},
url = {https://orenbochman.github.io/posts/2025/2025-01-10-Engineering-RL-Algs/},
langid = {en}
}