Trajectory trace & replay - based on Mesa
caching.
TLDR:
Ifs we can store state, action, reward trajectories (to a file) this would facilitate
- Off-policy sample based learning
- model based learning
- deep RL learning of NN in certain modern algs.
RL Recap for ABM:
- in RL there is a Markov Decision Process MDP that genereates a sequence
s_0, a_0, r_0, s_1, a_1, r_1 \ldots
s is the state. a the action . r the reward.
the goal of RL is to learn an optimal policy \pi in the sense of generating the maximum expected total rewards. by picking the the best action at each step.
note: the reward is for getting to the next state s’ using the action a.
- In model based RL agents learn a model of the environment to facilitate many quick planning steps between the more expensive/risky interactions with the environment.
the model is typically two functions
- P(s'|s,a,) - the Markov chain transitions denoted as T
- R(r|s,a,s') - the reward for the above transition
- in deep RL agents use ML techinques to approximate with a NeuralNet
the action value function Q_\pi(s,a) the policy \pi or the model.
- off policy learning.
A more general setting in RL has 2 policies
- the behavioral policy - which determines the actions of the agents
- Deterministic random policy - all possible actions have equal probability
- Epsilon soft - all possible actions have at least epsilon probability
- the target policy - which is a better
- pi star - the optimal policy.
the point:
if we generate a trace of S,A,R … from an ABM model and the ABM (plus some techincal caveats - it is an MDP, the ABM behaviour is ergodic etc) Then rl agents can use these traces to train a more optimal agent using off policy sample based learning methods like:
first visit MC with importance sampling
any visit MC with importance sampling
Q-learning,
Expected Sarsa
Dyna-Q+
vcan learn a model and use it for efficent planning.- e.g. how to navigate in a maze which changes over time.
The API
Model
- I don’t think its worth converting all ABM model to RL
- many ABM don’t have rewards structure.
- some don’t have much states
- doing this would mean being creative
Say we have a forest fire sim which we could extend as follows
add two new agents:
- fire starter - total rewards trees burnt
- can light a fire
- fire fighter - total trees remaining.
- can cut/move a tree
fire fighter can cut down k trees before the start of the simulation
fire starter can light a fire in some location x
fire fighter can cut down k more trees
- e.g. can we cut down some trees to stop a forest fire from spreading.
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Mesa \& Rl},
date = {2024-06-25},
url = {https://orenbochman.github.io/posts/2024/2024-06-25-mesa-rl/mesa-rl.html},
langid = {en}
}