In week 1 we define some key concepts like rewards, states, action, value functions, action values functions. We consider the the multi-armed bandit problem, leading to exploration explotation dillema, and the epsilon greedy algorithm.
In reinforcement learning, the agent generates its own training data by interacting with the world. The agent must learn the consequences of his own actions through trial and error, rather than being told the correct action β [@white2020fundamental]
K-armed bandits π
In the k-armed bandit problem there is an agent who is assigned a states by the environment and must learn which action a from the possible set of actionsA leads to the goal state through a signal based on the greatest expected reward.
One way this can be achieved is using a Bayesian updating scheme starting from a uniform prior.
Temporal nature of the bandit problem
The bandit problem cam be static problem with a fixed reward distribution. However, more generally it is a temporal problem when the rewards distribution changes over time and agent must learn to adapt to these changes.
Difference between bandits and RL
In the typical bandit setting there is only one state. So after we pull the arm nothing in the problem changes.
Bandits problems where agents can discriminate between states are called contextual bandits.
However, bandits embody one of the main themes of RL - that of estimating an expected reward for different actions.
In the more general RL setting we will be interested in more general problems where actions will lead the agent to new states and the goal is some specific state we need to reach.
bandit
Example 1 (Using Multi-armed bandit to randomize a medical trial)
agent is the doctor
actions {blue, yellow, red} treatment
k = 3
the rewards are the health of the patientsβ blood pressure.
a random trial in which a doctor need to pick one of three treatments.
q(a) is the mean of the blood pressure for the patient.
clinical trial
Action Values and Greedy Action Selection
The value of an action is its expected reward which can be expressed mathematically as:
\mathbb{E}[r \vert a] means expectation of a reward given some action a Since agents want to maximize rewards, recalling the definition of expectations we can write this as:
The goal of the agent is to maximize the expected reward which we can express mathematically as:
\arg\max_a q(a)=\sum_r p(r \vert a) \times r \qquad \text{(Greedification)}
\tag{2}
where:
\arg \max_a means the argument a maximizes - so the agent is looking for the action that maximizes the expected reward and the outcome is an action.
Reward, Return, and Value Functions
The rewardr is the immediate feedback from the environment after the agent takes an action.
The returnG_t is the total discounted reward from time-step t.
The value functionv(s) of an MRP is the expected return starting from state s.
decisions
example of decisions under uncertainty:
movie recommendation.
clinical trials.
music recommendation.
food ordering at a restaurant.
why discuss bandits
It best to consider issues and algorithms design choices in the simplest setting first. The bandit problem is the simplest setting for RL. More advanced algorithms will incorporate parts we use to solve this simple settings.
maximizing rewards.
balancing exploration and exploitation.
estimating expected rewards for different actions.
are all problems we will encounter in both the bandit and the more general RL setting.
Lesson 2: What to learn: understanding Action Values
Goals
What are action-value estimation methods?
estimating action values
In Tabular RL settings The action value function q is nothing more than a table with one {state, action} pair per row and its value. More generally, like when we will consider function approximation in course 3, it is a mapping from {state, action} pair to a expected reward.
State s
Action a
Action value q(s,a)
0
red treatment
0.25
0
yellow treatment
0.75
0
blue treatment
0.5
The higher the action value q(a) of an action a, the more likely it is to lead us to a better state which is closer to the objective. We can choose for each state the best or one of the best choices giving us a plan for navigating the state space to the goal state.
Q_t(a) \doteq \frac{\text{sum of rewards for action a taken time } t}{\text{number of times action a was taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i}{t-1} \qquad
\tag{3}
The main idea of RL is that we can propagate values from an one adjacent state to another. We can start with the uniform stochastic policy and use it to estimate/learn the action values. Action values will decrease for actions leads to a dead end. And it will increase in the direction of the goal but only once the influence of the goal has propagated. A continuing theme in RL is trying to increase the efficiency for propagation of rewards across the action values.
Knowing the minimum number of action needed to reach a goal can be an approximate indicator of the action value.
A second idea is that once we have let the influence of dead end and the goals spread enough we may have enough information to improve the initial action value to a point where each action is the one of the best choices. We call picking the one of the best action greedy selection and it leads to a deterministic policy. This is the optimal policy, it might not be unique since some actions might be tied in terms of their rewards. However for all of these we cannot do any better.
Exploration and Exploitation definition and dilemma
In the bandit setting we can define:
Exploration
Testing any action that might be better than our best.
Exploitation
Using the best action.
Should the doctor explore new treatments that might harm his patients or exploit the current treatment. In real life bacteria gain immunity to antibiotics so there is merit to exploring new treatments. However, a new treatment can be harmful to some patients. Ideally we want to enjoy the benefits of the best treatment but to be open to new and better alternatives but we can only do one at a time.
Since exploitation is by definition mutually exclusive with exploration we must choose one and give up the benefits of the other. This is the dilemma of Exploration and Exploitation. How an agent resolves this dilemma in practice depends on the agentβs preferences and the type of state space it inhabits, if it has just started or encounters a changing landscape, it should make an effort to explore, on the other hand if it has explored enough to be certain of a global maximum it would prefer to exploit.
Defining Online learning ?
Online learning
learning by updating the agentβs value function or the action value function step by step as an agent transverses the states seeking the goal. Online learning is important to handle MDP which can change.
One simple way an agent can use online learning is to try actions by random and keep track of the subsequent states. Eventually we should reach the goal state. If we repeat this many times we can estimate the expected rewards for each action.
Sample Average Method for estimating Action Values Incrementally
Action values help us make decision. Letβs try and make estimate action values more formal using the following method:
q_t(a)=\frac{\text{sum or rewards when a taken prior to t}}{\text{number of times a taken prior to t}}
=\frac{\sum_{t=1}^{t-1} R_i \mathbb{I}_{A_i=a}}{\sum_{t=1}^{t-1}\mathbb{I}_{A_i=a} } \qquad
The Ζ-greedy policy uses a simple heuristic to balance exploration with exploitation. The idea is to choose the best action with probability 1-\epsilon and to choose a random action with probability \epsilon.
A problem with Ζ-greedy is that it is not optimal in the long run.
Even after it has found the best course of action it will continue to explore with probability \epsilon.
This is because the policy is not adaptive.
One method is too reduce \epsilon over time. However unless there is a feedback from the environment this will likely stop exploring too soon or too late thus providing sub-optimal returns.
The following is a simple implementation of the Ζ-greedy algorithm in Python from geeksforgeeks.org
Code
# Import required libraries import numpy as np import matplotlib.pyplot as plt # Define Action class class Actions: def__init__(self, m): self.m = m self.mean =0self.N =0# Choose a random action def choose(self): return np.random.randn() +self.m # Update the action-value estimate def update(self, x): self.N +=1self.mean = (1-1.0/self.N)*self.mean +1.0/self.N * x def run_experiment(m1, m2, m3, eps, N): actions = [Actions(m1), Actions(m2), Actions(m3)] data = np.empty(N) for i inrange(N): # epsilon greedy p = np.random.random() if p < eps: j = np.random.choice(3) else: j = np.argmax([a.mean for a in actions]) x = actions[j].choose() actions[j].update(x) # for the plot data[i] = x cumulative_average = np.cumsum(data) / (np.arange(N) +1) # plot moving average ctr plt.plot(cumulative_average) plt.plot(np.ones(N)*m1) plt.plot(np.ones(N)*m2) plt.plot(np.ones(N)*m3) plt.xscale('log') plt.show() for a in actions: print(a.mean) return cumulative_average
In the short term we may maximize rewards following the best known course of action. However this may represent a local maximum.
In the long term agents that explore different options and keep uncovering better options until they find the best course of action corresponding to the global maximum.
To get the best of both worlds we need to balance exploration and exploitation ideally using a policy that uses feedback to adapt to its environment.
Optimistic initial values
Optimistic initial values
Setting all initially action values greater than the algorithmically available values in [0,1]
The methods we have discussed are dependent on the initial action-value estimates, Q_1(a). In the language of statistics, we call these methods biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once. For methods with constant \alpha, the bias is permanent, though decreasing over time.
Benefits of optimistic initial values for early exploration
Setting the initial action values to be higher than the true values has the effect of causing various bandit algorithm to try to exploit them - only to find out that most values are not as rewarding as it was led to expect.
What happens is that the algorithm will initially explore more than it would have otherwise. Possibly even trying all the actions at least once.
In the short-term it will perform worse than Ζ- greedy which tend to exploit. But as more of the state space is explored at least once the algorithm will beat an Ζ-greedy policy which can take far longer to explore the space and find the optimal options.
The effect of optimistic initial action-value estimates
Criticisms of optimistic initial values
Optimistic initial values only drive early exploration. The agent will stop exploring once this is done.
For a non-stationary problems - this is inadequate.
In a real world problems the maximum reward is an unknown quantity.
The UCB action selection method
UCB is an acronym for Upper Confidence Bound. The idea behind it is to select the action that has the highest upper confidence bound. This has the advantage over epsilon greedy that it will explore more in the beginning and then exploit more as the algorithm progresses.
c is a constant that determines the degree of exploration
N_t(a) is the number of times action a has been selected prior to time t
UCB intuition
The idea is we the action for which the action value plus the highest possible uncertainty give the highest sum. We are being optimistic in assuming this choice will give the highest reward. In reality any value in the confidence interval could be the true value. Each time we select an action we reduce the uncertainty in the exploration term and we also temper our optimism of the upper confidence bound by the number of times we have selected the action. This means that we will prefer to visit the actions that have not been visited as often.
The main advantage of UCB is that it is more efficient than epsilon greedy in the long run. If we measure the cost of learning in terms of the regret - the difference between the expected reward of the optimal action and the expected reward of the action we choose. UCB has a lower regret than epsilon greedy. The downside is that it is more complex and requires more computation.
Thompson sampling is basically like UCB but taking the Bayesian approach to the bandit problem. We start with a prior distribution over the action values and then update this distribution as we take actions. The action we choose is then sampled from the posterior distribution. This has the advantage that it is more robust to non-stationary problems than UCB. The downside is that it is more computationally expensive.
This is a heuristic to ensure initial exploration of all actions by assuming that untried actions have a high expected reward. We then try to exploit them but end up successively downgrading their expected reward when they do not match our initial optimistic assessment.
The downside to this approach is when the space of action is continuous so we can never get to the benefits of exploration.
from tqdm import tqdmfrom mesa import Model, Agentfrom mesa.time import RandomActivationimport numpy as npclass EpsilonGreedyAgent(Agent):""" This agent implements the epsilon-greedy """def__init__(self, unique_id, model, num_arms, epsilon=0.1):super().__init__(unique_id,model)self.num_arms = num_armsself.epsilon = epsilonself.q_values = np.zeros(num_arms) # Initialize Q-value estimatesself.action_counts = np.zeros(num_arms) # Track action countsdef choose_action(self):if np.random.rand() <self.epsilon:# Exploration: Choose random armreturn np.random.randint(0, self.num_arms)else:# Exploitation: Choose arm with highest Q-valuereturn np.argmax(self.q_values)def step(self, model): chosen_arm =self.choose_action() reward = model.get_reward(chosen_arm)assert reward isnotNone, "Reward is not provided by the model"self.action_counts[chosen_arm] +=1self.q_values[chosen_arm] = (self.q_values[chosen_arm] *self.action_counts[chosen_arm] + reward) / (self.action_counts[chosen_arm] +1)class TestbedModel(Model):""" This model represents the 10-armed bandit testbed environment. """def__init__(self, num_arms, mean_reward, std_dev,num_agents=1):super().__init__()self.num_agents = num_agentsself.num_arms = num_armsself.mean_reward = mean_rewardself.std_dev = std_devself.env_init()#self.arms = [None] * num_arms # List to store arm rewardsself.schedule = RandomActivation(self)for i inrange(self.num_agents):self.create_agent(EpsilonGreedyAgent, i, 0.1) def env_init(self,env_info={}):self.arms = np.random.randn(self.num_arms) # Initialize arm rewardsdef create_agent(self, agent_class, agent_id, epsilon):""" Create an RL agent instance with the specified class and parameters. """ agent = agent_class(agent_id, self, self.num_arms, epsilon)self.schedule.add(agent)return agentdef step(self):for agent inself.schedule.agents: chosen_arm = agent.choose_action() reward = np.random.normal(self.mean_reward, self.std_dev)self.arms[chosen_arm] = reward # Update arm reward in the model agent.step(self) # Pass the model instance to the agent for reward accessdef get_reward(self, arm_id):# Access reward from the stored listreturnself.arms[arm_id]# Example usagemodel = TestbedModel(10, 0, 1) # Create model with 10 armsnum_runs =200# The number of times we run the experimentnum_steps =1000# The number of pulls of each arm the agent takes# Run simulation for multiple stepsfor _ in tqdm(range(num_runs)):for _ inrange(num_steps): model.step() model.step()
/home/oren/.local/lib/python3.10/site-packages/mesa/time.py:82: FutureWarning:
The AgentSet is experimental. It may be changed or removed in any and all future releases, including patch releases.
We would love to hear what you think about this new feature. If you have any thoughts, share them with us here: https://github.com/projectmesa/mesa/discussions/1919
0%| | 0/200 [00:00<?, ?it/s] 4%|β | 9/200 [00:00<00:02, 81.16it/s] 9%|β | 18/200 [00:00<00:02, 78.80it/s] 13%|ββ | 26/200 [00:00<00:02, 77.88it/s] 17%|ββ | 34/200 [00:00<00:02, 78.48it/s] 21%|ββ | 42/200 [00:00<00:02, 78.66it/s] 26%|βββ | 51/200 [00:00<00:01, 79.00it/s] 30%|βββ | 59/200 [00:00<00:01, 78.95it/s] 34%|ββββ | 67/200 [00:00<00:01, 78.94it/s] 38%|ββββ | 75/200 [00:00<00:01, 79.14it/s] 42%|βββββ | 83/200 [00:01<00:01, 79.37it/s] 46%|βββββ | 91/200 [00:01<00:01, 79.55it/s] 50%|βββββ | 99/200 [00:01<00:01, 79.04it/s] 54%|ββββββ | 108/200 [00:01<00:01, 79.46it/s] 58%|ββββββ | 116/200 [00:01<00:01, 78.05it/s] 62%|βββββββ | 124/200 [00:01<00:00, 77.45it/s] 66%|βββββββ | 132/200 [00:01<00:00, 75.88it/s] 70%|βββββββ | 140/200 [00:01<00:00, 74.49it/s] 74%|ββββββββ | 148/200 [00:01<00:00, 73.76it/s] 78%|ββββββββ | 156/200 [00:02<00:00, 73.19it/s] 82%|βββββββββ | 164/200 [00:02<00:00, 72.36it/s] 86%|βββββββββ | 172/200 [00:02<00:00, 72.04it/s] 90%|βββββββββ | 180/200 [00:02<00:00, 72.29it/s] 94%|ββββββββββ| 188/200 [00:02<00:00, 72.33it/s] 98%|ββββββββββ| 196/200 [00:02<00:00, 72.68it/s]100%|ββββββββββ| 200/200 [00:02<00:00, 75.97it/s]