Policy Gradient

Prediction and Control with Function Approximation

Coursera
notes
rl
reinforcement learning
the k-armed bandit problem
bandit algorithms
exploration
explotation
epsilon greedy algorithm
sample avarage method
Author

Oren Bochman

Published

Thursday, April 4, 2024

RL algorithms

RL algorithms

Lesson 1: Learning Parameterized Policies

Learning Objectives

Lesson 2: Policy Gradient for Continuing Tasks

Learning Objectives

Lesson 3: Actor-Critic for Continuing Tasks

Learning Objectives

Lesson 4: Policy Parameterizations

Learning Objectives
Discussion prompt

Are tasks really ever continuing? Everything eventually breaks or dies. It’s clear that individual people do not learn from death, but we don’t live forever. Why might the continuing problem formulation be a reasonable model for long-lived agents?

References

Sutton, R. S., and A. G. Barto. 2018. Reinforcement Learning, Second Edition: An Introduction. Adaptive Computation and Machine Learning Series. MIT Press. http://incompleteideas.net/book/RLbook2020.pdf.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:
@online{bochman2024,
  author = {Bochman, Oren},
  title = {Policy {Gradient}},
  date = {2024-04-04},
  url = {https://orenbochman.github.io/notes/RL/c3-w4.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2024. “Policy Gradient.” April 4, 2024. https://orenbochman.github.io/notes/RL/c3-w4.html.