
reference:
UCL Course on RL
Introduction
Policy-Based Reinforcement Learning
- In the last lecture we approximated the value or action-value function using parameter $theta$
- A policy was generated directly from the value function
- In this lecture, we will directly parametrise the policy
$$ pi_{theta} (s,a) = P(a|s,theta) $$ - We will focus again on model-free reinforcement learning
Value-Based and Policy-Based RL
- Value Based
- Learnt Value Function
- Implicit policy (e.g. $epsilon$-greedy)
- Policy Based
- No Value Function
- Learnt Policy
- Actor-Critic
- Learnt Value Function
- Learnt Policy
Advantages of Policy-Based RL
Advantages:
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages: - Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Policy Search
Policy Objective Functions
- Goal: given policy $pi_{theta} (s,a) $ with parameters $theta$, find best $theta$.
- But how do we measure the quality of a policy $pi_theta$?
- In episodic environments we can use the start value.
$$ J_1 (theta) = V^{pi_theta} (s_1) = E_{pi_theta} (v_1) $$ - In continuing environments we can use the average value
$$ J_{avV} (theta) = sum_{s} d^{pi_theta} (s) V^{pi_theta} (s) $$ - Or the average reward per time-step
$$ J_{avR} (theta) = sum_{s} d^{pi_theta} (s) sum_{a} pi_theta (s,a) R_s^a $$ - where $d^{pi_theta} (s)$ is stationary distribution of Markov chain for $pi_theta$
Policy Optimisation
- Policy based reinforcement learning is an optimisation problem
- Find $theta$ that maximises $J(theta)$
- Some approaches do not use gradient
- Hill climbing
- Simplex/amoeba/Nelder Mead
- Genetic algorithms
- Greater efficiency often possible using gradient
- Gradient descent
- Conjugate gradient
- Quasi-Newton
- We focus on gradient descent, many extensions possible
- And on methods that exploit sequential structure
Finite Difference Policy Gradient
Policy Gradient
- Let $J(theta)$ be any policy objective function
- Policy gradient algorithms search for a local maximum in $J(theta)$ by ascending the gradient of the policy, w.r.t parameters $theta$
$$ Delta theta = alpha nabla_theta J(theta) $$ - Where $Delta_theta J(theta)$ is the policy gradient, and $alpha$ is a step-size parameter
Computing Gradients By Finite Differences
- To evaluate policy gradient of $pi_theta (s,a)$
- For each dimension $k in [1,n]$
- Estimate $k$th partial derivative of objective function w.r.t $theta$
- By perturbing $theta$ by small amount $epsilon$ in $k$th dimension
- Uses $n$ evaluations to compute policy gradient in $n$ dimensions
- Simple, noisy, inefficient - but sometimes effective
- Works for arbitrary policies, even if policy is not differentiable
Monte-Carlo Policy Gradient
Likelihood Ratios
Score Function
- We now compute the policy gradient analytically
- Assume policy $pi_theta$ is differentiable whenever it is non-zero
- and we know the gradient $nabla_theta pi_theta (s,a) $
- Likelihood ratios exploit the following identity
$$ nabla_theta pi_theta (s,a) = pi_theta (s,a) frac{nabla_theta pi_theta (s,a)}{pi_theta (s,a)} = pi_{theta} (s,a) nabla_theta log pi_theta (s,a) $$ - The score function is $nabla_theta log pi_{theta} (s,a)$
Softmax Policy
- We will use a softmax policy as a running example
- Weight actions using linear combination of features $phi(s,a)^T theta$
- Probability of action is proportional to exponential weight
$$ pi_theta (s,a) propto e^{phi(s,a)^T theta} $$ - The score function is
$$ nabla_theta log pi_theta (s,a) = phi(s,a) - E_{pi_theta} (phi(s,cdot)) $$
Gaussian Policy
- In continuous action spaces, a Gaussian policy is natural
- Mean is a linear combination of state feature $mu(s) = phi(s)^T theta$
- Variance may be fixed $sigma^2$, or can also parametrised
- Policy is Gaussian, $a sim N(mu(s),sigma^2)$
- The score function is
$$ nabla_theta log pi_{theta} (s,a) = frac{(a-mu(s))phi(s)}{sigma^2} $$
Policy Gradient Theorem
One-Step MDPs
- Consider a simple class of one-step MDPs
- Starting in state $ssim d(s)$
- Terminating after one time-step with reward $r=R_{s,a}$
- Use likelihood ratios to compute the policy gradient
$$ J(theta) = E_{pi_theta} (r) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) R_{s,a} \ nabla_theta J(theta) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) nabla_theta log pi_theta (s,a) R_{s,a} = E_{pi_theta} (nabla_theta log pi_theta (s,a) r) $$
Policy Gradient Theorem
- The policy gradient theorem generalised the likelihood ratio approach to multi-step MDPs
- Replaces instantaneous reward $r$ with long-term value $Q^pi (s,a)$
- Policy gradient theorem applies to start state objective, average reward and average value objective, average reward and average value objective
Theorem
For any differentiable policy $pi_theta(s,a)$, for any of the policy objective functions $J=J_1,J_{avR}$ or $frac{1}{1-gamma} J_{avV}$, the policy gradient is
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q^{pi_theta} (s,a)right] $$
Monte-Carlo Policy Gradient(REINFORCE)
- Update parameters by stochastic gradient ascent
- Using policy gradient theorem
- Using return $v_t$ as an unbiased sample of $Q^{pi_theta} (s_t,a_t)$
$$ Delta theta_t = alpha nabla_theta log pi_theta (s_t,a_t)v_t $$
Actor-Critic Policy Gradient
Introduction to AC
Reducing Variance Using a Critic
- Monte-Carlo policy gradient still has high variance
- We use a critic to estimate the action-value function, $ Q_w (s,a) approx Q^{pi_theta} (s,a)$
- Actor-critic algorithms maintain two sets of parameters
- Critic Updates action-value function parameters $w$
- Actor Updates policy parameters $theta$, in direction suggested by critic
- Actor-critic algorithms follow an approximate policy gradient
$$ nabla_theta J(theta) approx E_{pi_theta} (nabla_theta log pi_theta (s,a) Q_w (s,a) ) \ Delta theta = alpha nabla_theta log pi_theta (s,a) Q_w (s,a) $$
Estimating the Action-Value Function
- The critic is solving a familiar problem: policy evaluation
- How good is policy $pi_theta$ for current parameters $theta$?
- This problem was explored in previous two lectures, e.g.
- Monte-Carlo policy evaluation
- Temporal-Difference learning
- TD($lambda$)
- Could also use e.g least-squares policy evaluation
Action-value Actor-Critic
- Simple actor-critic algorithm based on action-value critic
- Using linear value fn approx $Q_w (s,a) = phi(s,a)^T w $
- Critic Updates $w$ by linear TD(0)
- Actor Updates $theta$ by policy gradient

Compatible Function Approximation
Bias in Actor-Critic Algorithms
- Approximating the policy gradient introduces bias
- A biased policy gradient may not find the right solution
- e.g if $Q_w(s,a)$ uses aliased features, can we solve gridworld example?
- Luckily, if we choose value function approximation carefully
- Then we can avoid introducing any bias
- i.e We can still follow the exact policy gradient
Compatible Function Approximation
Theorem(Compatible Function Approximation Theorem)
If the following two conditions are satisfied
- Value function approximator is compatible to the policy
$$ nabla_w Q_w(s,a) = nabla_theta log pi_theta (s,a) $$- Value function parameters $w$ minimise the mean-squared error
$$ epsilon = E_{pi_theta} left[ (Q^{pi_theta}(s,a) -Q_w(s,a))^2right] $$
Then the policy gradient is exact
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$
Proof of Compatible Function Approximation Theorem
If $w$ is chosen to minimise mean-squared error, gradient of $epsilon$ w.r.t $w$ must be zero
So $Q_w (s,a)$ can be substituted directly into the policy gradient
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$
Advantage Function Critic
Reducing Variance Using a Baseline
- We subtract a baseline function $B(s)$ from the policy gradient
- This can reduce variance, without changing expectation
$$ E_{pi_theta} left[ nabla_theta log pi_theta (s,a) B(s) right] = sum_{s in S} d^{pi_theta} (s) sum_s nabla_theta pi_theta (s,a) B(s) \ = sum_{s in S} d^{pi_theta} B(s) nabla_theta sum_{a in A} pi_theta (s,a) = 0 $$ - A good baseline is the state value function $B(s) = V^{pi_theta} (s)$
- So we can rewrite the policy gradient using the advantage function $A^{pi_theta} (s,a)$
$$ A^{pi_theta} (s,a) = Q^{pi_theta} (s,a) - V^{pi_theta} (s) \ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] $$
Estimating the Advantage Function
- The advantage function can significantly reduce variance of policy gradient
- So the critic should really estimate the advantage function
- For example, by estimating both $V^{pi_theta} (s)$ and $Q^{pi_theta} (s,a)$
- Using two function approximating and two parameter vectors
$$ V_v (s) approx V^{pi_theta} (s) \ Q_w (s,a) approx Q^{pi_theta} (s,a) \ A(s,a) = Q_w(s,a) - V_v (s) $$ - And updating both value functions by e.g TD learning
- For the true value function $V^{pi_theta} (s)$, the TD error $delta^{pi_theta}$
$$ delta^{pi_theta} = r + gamma V^{pi_theta} (s’) - V^{pi_theta} (s)$$ - is an unbiased estimate of the advantage function
$$ E_{pi_theta} left[ delta^{pi_theta} | s,aright]= E_{pi_theta} left[r+gamma V^{pi_theta} (s’) | s,a right] - V^{pi_theta} (s) \ = Q^{pi_theta} (s,a) -V^{pi_theta} (s) = A^{pi_theta} (s,a) $$ - So we can use the TD error to compute the policy gradient
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) delta^{pi_theta} right] $$ - In practice we can use an approximate TD error
$$ delta_v = r + gamma V_v (s’) - V_v (s) $$ - This approach only requires one set of critic parameters $v$
Eligibility Traces
Critics at Different Time-Scales
- Critic can estimate value function $V_theta(s)$ from many targets at different time-scales
- For MC, the target is the return $v_t$
$$ Delta theta = alpha (v_t - V_theta(s)) phi(s) $$ - For TD(0), the target is the TD target $r+ gamma V(s’)$
$$ Delta theta = alpha (r + gamma V(s’) - V_theta(s)) phi(s) $$ - For forward-view TD($lambda$), the target is the $lambda$-return $v_t^lambda$
$$ Delta theta =alpha (v_t^lambda - V_theta (s)) phi(s) $$ - For backward-view TD($lambda$), we use eligibility traces
$$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_t = gamma lambda e_{t-1} + phi(s_t) \ Delta theta = alpha delta_t e_t $$
- For MC, the target is the return $v_t$
Policy Gradient with Eligibility Traces
- Just like forward-view TD($lambda$), we can mix over time-scales
$$ Delta theta = alpha (v_t^lambda - V_v (s_t) ) nabla_theta log pi_theta (s_t,a_t) $$ - where $v_t^lambda -V_v(s_t)$ is a biased estimate of advantage fn
- Like backward-view TD($lambda$), we can also use eligibility traces
- By equivalence with TD($lambda$), substituting $phi(s) = nabla_theta log pi_theta (s,a)$
$$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_{t+1} = lambda e_{t} + nabla_theta log pi_theta (s,a) \ Delta theta = alpha delta_t e_t $$
- By equivalence with TD($lambda$), substituting $phi(s) = nabla_theta log pi_theta (s,a)$
- This update can be applied online, to incomplete sequences
Natural Policy Gradient
Alternative Policy Gradient Direction
- Gradient ascent algorithm can follow any ascent direction
- A good ascent direction can significantly speed convergence
- Also, a policy can often be reparametrised without changing action probabilities
- For example, increasing score of all actions in a softmax policy
- The vanilla gradient is sensitive to these reparametrisations
Natural Policy Gradient
- The natural policy gradient is parametrisation independent
- It finds ascent direction that is closet to vanilla gradient, when changing policy by a small, fixed amount
$$ nabla_theta^{nat} pi_theta (s,a) = G_theta^{-1} nabla_theta pi_theta (s,a) $$ - where $G_theta$ is the Fisher information matrix
$$ G_theta = E_theta left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T right] $$
Natural Actor-Critic
- Using compatible function approximation
$$ nabla_w A_w (s,a) = nabla_theta log pi_theta (s,a) $$ - So the natural policy gradient simplifies,
$$ nabla_theta J(theta) = E_{theta_pi} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] \ = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T w right] = G_theta w $$
so we can get $ nabla_theta^{nat} J(theta) = w $ - update actor parameters in direction of critic parameters
Summary of Policy Gradient Algorithms
- The policy gradient has many equivalent forms

- Each leads a stochastic gradient ascent algorithm
- Critic uses policy evaluation (e.g MC or TD learning) to estimate $Q^pi (s,a), A^pi (s,a)$ or $V^pi (s)$




近期评论