Engineering Reinforcement Learning Algorithms

As I progressed though the RL specialization I came to realize that the research is mostly about coming up with a new algorithm that works better. This is tricky and so in many cases what happens is that people end up considering some narrow setting better yet is it is new and so they can publish a paper claiming novelty. On the other hand almost all algorithms revolve around a rather small set of techniques. One might call these techniques ‘magic tricks’ to avoid the fact that they are also grounded in computer science and mathematics. Some researches show thier algorithms converge better.

Outside academia RL is a tool. We don’t have quick simulators but we do have challenging examples. I notice that most people use a few algorithms rather than coding thier own from scratch. This makes a lot of sense if out problem is very similar to other problems that the algorithm has been tested on. But what if our problem is different? We can try a bunch of algorithms and see which one works best. But what if they are mediocre?

Another approach which I consider continuing the approach espoused in (CiationNeeded?) The Algorithm Design Manual by Steven S. Skiena and (CiationNeeded?) Algorithm Design by Jon Kleinberg and Éva Tardos.

We have many bits and pieces of RL algorithms that do all sorts of magic tricks
WE have different problem settings that need different tricks to work and in which other tricks don’t work

I imagine a research algorithm that has all these tricks available in a modular way and that can adaptively reconfigure itself to the problem at hand.

And one more thing it would then expose its hyperparameters to be tuned by Optuna or some other hyperparameter optimization tool….

All this power suggest the one ring to rule them all so perhaps this might be the master algorithm.

Magical Tricks:

From the domain of Bandits

epsilon greedy exploration
upper confidence bound exploration
thompson sampling exploration
gittins index exploration
‘stateless’ setting

note: it appears we might want the setting to be a configuration parameter

MDPs

Bellman Equation for V Value Function
Bellman Equation for Q Action Value Function
Bellman Optimality Equation for V Value Function
Bellman Optimality Equation for Q Action Value Function
Bellman Equation for Average Reward
Bellman Optimality Equation for Average Reward
discounted rewards gamma trick
average reward formulation

Exploration using - Optimistic Initialisation

algorithmic tricks
- value iteration
- policy iteration
- generalized value iteration
- generalized policy iteration
tabular setting
limited horizon setting
infinite horizon setting

Montecarlo Methods

First visit Montecarlo
Every visit Montecarlo
Monte Carlo target
episodic setting
in the continuos setting Monte Carlo require an eligibility traces, a replay buffer or time limit i.e. temporal abstraction.

Temporal Difference Learning

td targets for V Value Function
td targets for Q Action Value Function
bootstrapping
SARSA
Q-Learning
Expected SARSA

Model Based RL

replay buffer
proretised replay buffer
prioritised sweeping
Dyna-Q
Dyna-Q+

Temporal Aggregation

SMDP (semi-markov decision processes)
options (temporal abstraction)
gvf (generalized value functions)
meta gradients - learning Rewards over multiple agent lifetime
‘liflelong’ setting

Partial Observability

POMDPs

Multi-Agent RL

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  title = {Engineering {Reinforcement} {Learning} {Algorithms}},
  date = {2024-01-10},
  url = {https://orenbochman.github.io/posts/2025/2025-01-10-Engineering-RL-Algs/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. “Engineering Reinforcement Learning Algorithms.” January 10, 2024. https://orenbochman.github.io/posts/2025/2025-01-10-Engineering-RL-Algs/.