Mesa & RL – Oren Bochman’s Blog

Trajectory trace & replay - based on `Mesa` caching.

TLDR:

Ifs we can store state, action, reward trajectories (to a file) this would facilitate

Off-policy sample based learning
model based learning
deep RL learning of NN in certain modern algs.

RL Recap for ABM:

in RL there is a Markov Decision Process MDP that genereates a sequence

s_0, a_0, r_0, s_1, a_1, r_1 \ldots

s is the state. a the action . r the reward.

the goal of RL is to learn an optimal policy \pi in the sense of generating the maximum expected total rewards. by picking the the best action at each step.

note: the reward is for getting to the next state s’ using the action a.

In model based RL agents learn a model of the environment to facilitate many quick planning steps between the more expensive/risky interactions with the environment.

the model is typically two functions

P(s'|s,a,) - the Markov chain transitions denoted as T
R(r|s,a,s') - the reward for the above transition

in deep RL agents use ML techinques to approximate with a NeuralNet

the action value function Q_\pi(s,a) the policy \pi or the model.

off policy learning.

A more general setting in RL has 2 policies

the behavioral policy - which determines the actions of the agents
- Deterministic random policy - all possible actions have equal probability
- Epsilon soft - all possible actions have at least epsilon probability
the target policy - which is a better
- pi star - the optimal policy.

the point:

if we generate a trace of S,A,R … from an ABM model and the ABM (plus some techincal caveats - it is an MDP, the ABM behaviour is ergodic etc) Then rl agents can use these traces to train a more optimal agent using off policy sample based learning methods like:

first visit MC with importance sampling
any visit MC with importance sampling
Q-learning,
Expected Sarsa
Dyna-Q+ vcan learn a model and use it for efficent planning.
- e.g. how to navigate in a maze which changes over time.

The API

Model

I don’t think its worth converting all ABM model to RL
many ABM don’t have rewards structure.
some don’t have much states
doing this would mean being creative

Say we have a forest fire sim which we could extend as follows

add two new agents:

- fire starter - total rewards trees burnt 
    - can light a fire
- fire fighter - total trees remaining.
    - can cut/move a tree

fire fighter can cut down k trees before the start of the simulation
fire starter can light a fire in some location x
fire fighter can cut down k more trees
- e.g. can we cut down some trees to stop a forest fire from spreading.

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  title = {Mesa \& {RL}},
  date = {2024-06-25},
  url = {https://orenbochman.github.io/posts/2024/2024-06-25-mesa-rl/mesa-rl.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. “Mesa & RL.” June 25, 2024. https://orenbochman.github.io/posts/2024/2024-06-25-mesa-rl/mesa-rl.html.

Trajectory trace & replay - based on Mesa caching.

TLDR:

The API

Model

Citation

Trajectory trace & replay - based on `Mesa` caching.