Learning and Generalization – Notes on Reinforcement Learning

RL logo

In this post I want to meditate on how different RL and ML algorithms can learn and how one might access what is learned in these algorithms.

Well access in the sense of use it to perhaps generalize to new tasks or to understand what the agent has learned about the world.

But also how one might engineer such a capability into other algorithms and into agents in general.

Models in RL try to approximate MDP dynamics using its transition and rewards

-   In ML we often use boosting and bagging to aggregate very simple models.
-   In RL we often replace the model by sampling from a replay buffer of the agent's past experiences.

The problem for a general ai is very much the problem of transfer learning in RL.

agents learn a very specific policy for a very specific task - the learned representation cannot be mapped to other tasks or even other states in the same task.
if agents learning was decomposed into
- learning very general policies that solved more abstract problems and then
- learning a good composition of these policies to solve the specific problem.
- only after getting to this point would the agent try to optimize the policy for the specific task.
- e.g. chess
  - learn the basic moves and average value of pieces
  - learning tactics - short term goals
  - learning about end game
    - update the value of pieces based on the ending
  - learning about strategy
    - positional play
      - learn about pawn formations and weak square
        
        value of pawn formations
        
        how they can be used with learned tactics.
      - the center
        
        add value to pieces based on their position on the board
      - open files and diagonals
    - long term plans
      - minority attack, king side attack, central breakthrough
      - creating a passed pawn
      - exchanging to win in the end game
      - sacrificing material to get a better position
      - attacking the king
    - castling
    - piece development and the center
    - tempo
  - localize value of pieces in different positions on the board using the learned tactics and strategy.

BNP knowledge patterns

Bayesian Nonparametrics models and hierarchical model encode knowledge in some noteworthy ways:

Priors encode expert beliefs and can be tested using a prior predictive check. Posteriors can be queried in different ways to gain insight into what the model has learned with reperct to specific parameters.
Conjugte priors which maintain a fixed structure while updating beliefs based on incoming evidence. These can oftern be interpreted in terms of the wight of the prior and the weight of the evidence.
Hierachical models allow us to encode dependencies between parameters directly into the model structure.
Hierarchical models use partial pooling to share statistical strength across related groups of data points. This allows the model to initially rely on a prior but shift towards the evidence as it accrues.
using covariance mat
- learning in Bayesian models is about updating the initial beliefs based on incoming evidence.

CI may be useful here

Is in a big way about mapping knowledge into
- Statistical joint probabilities,
- Casual concepts that are not in the joint distributions like interventions and Contrafactuals, latent, missing, mediators, confounders, etc.
- Hypothesizing a causal structural model, deriving a statistical model and Testing it against the data.
- Interventions in the form of actions and options -
Many key ideas in RL are counterfactual reasoning
- Off-policy learning is about learning from data generated by a different policy.
- Options are like do operations (interventions)
- Choosing between actions and options is like contrafactual reasoning.
Using and verifying CI models could be the way to unify the spatial and temporal abstraction in RL.

Game theory

For game theoretic settings reaching an equilibrium can also be form of learning as different equilibria encode different strategies which extend policy from RL.