rl7 policy gradient

reference:
UCL Course on RL

Introduction

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-value function using parameter $theta$
A policy was generated directly from the value function
In this lecture, we will directly parametrise the policy
$$ pi_{theta} (s,a) = P(a|s,theta) $$
We will focus again on model-free reinforcement learning

Value-Based and Policy-Based RL

Value Based
- Learnt Value Function
- Implicit policy (e.g. $epsilon$-greedy)
Policy Based
- No Value Function
- Learnt Policy
Actor-Critic
- Learnt Value Function
- Learnt Policy

Advantages of Policy-Based RL
Advantages:

Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies
Disadvantages:
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance

Policy Search

Policy Objective Functions

Goal: given policy $pi_{theta} (s,a) $ with parameters $theta$, find best $theta$.
But how do we measure the quality of a policy $pi_theta$?
In episodic environments we can use the start value.
$$ J_1 (theta) = V^{pi_theta} (s_1) = E_{pi_theta} (v_1) $$
In continuing environments we can use the average value
$$ J_{avV} (theta) = sum_{s} d^{pi_theta} (s) V^{pi_theta} (s) $$
Or the average reward per time-step
$$ J_{avR} (theta) = sum_{s} d^{pi_theta} (s) sum_{a} pi_theta (s,a) R_s^a $$
where $d^{pi_theta} (s)$ is stationary distribution of Markov chain for $pi_theta$

Policy Optimisation

Policy based reinforcement learning is an optimisation problem
Find $theta$ that maximises $J(theta)$
Some approaches do not use gradient
- Hill climbing
- Simplex/amoeba/Nelder Mead
- Genetic algorithms
Greater efficiency often possible using gradient
- Gradient descent
- Conjugate gradient
- Quasi-Newton
We focus on gradient descent, many extensions possible
And on methods that exploit sequential structure

Finite Difference Policy Gradient

Policy Gradient

Let $J(theta)$ be any policy objective function
Policy gradient algorithms search for a local maximum in $J(theta)$ by ascending the gradient of the policy, w.r.t parameters $theta$
$$ Delta theta = alpha nabla_theta J(theta) $$
Where $Delta_theta J(theta)$ is the policy gradient, and $alpha$ is a step-size parameter

Computing Gradients By Finite Differences

To evaluate policy gradient of $pi_theta (s,a)$
For each dimension $k in [1,n]$
- Estimate $k$th partial derivative of objective function w.r.t $theta$
- By perturbing $theta$ by small amount $epsilon$ in $k$th dimension
Uses $n$ evaluations to compute policy gradient in $n$ dimensions
Simple, noisy, inefficient - but sometimes effective
Works for arbitrary policies, even if policy is not differentiable

Monte-Carlo Policy Gradient

Likelihood Ratios

Score Function

We now compute the policy gradient analytically
Assume policy $pi_theta$ is differentiable whenever it is non-zero
and we know the gradient $nabla_theta pi_theta (s,a) $
Likelihood ratios exploit the following identity
$$ nabla_theta pi_theta (s,a) = pi_theta (s,a) frac{nabla_theta pi_theta (s,a)}{pi_theta (s,a)} = pi_{theta} (s,a) nabla_theta log pi_theta (s,a) $$
The score function is $nabla_theta log pi_{theta} (s,a)$

Softmax Policy

We will use a softmax policy as a running example
Weight actions using linear combination of features $phi(s,a)^T theta$
Probability of action is proportional to exponential weight
$$ pi_theta (s,a) propto e^{phi(s,a)^T theta} $$
The score function is
$$ nabla_theta log pi_theta (s,a) = phi(s,a) - E_{pi_theta} (phi(s,cdot)) $$

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural
Mean is a linear combination of state feature $mu(s) = phi(s)^T theta$
Variance may be fixed $sigma^2$, or can also parametrised
Policy is Gaussian, $a sim N(mu(s),sigma^2)$
The score function is
$$ nabla_theta log pi_{theta} (s,a) = frac{(a-mu(s))phi(s)}{sigma^2} $$

Policy Gradient Theorem

One-Step MDPs

Consider a simple class of one-step MDPs
- Starting in state $ssim d(s)$
- Terminating after one time-step with reward $r=R_{s,a}$
Use likelihood ratios to compute the policy gradient
$$ J(theta) = E_{pi_theta} (r) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) R_{s,a} \ nabla_theta J(theta) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) nabla_theta log pi_theta (s,a) R_{s,a} = E_{pi_theta} (nabla_theta log pi_theta (s,a) r) $$

Policy Gradient Theorem

The policy gradient theorem generalised the likelihood ratio approach to multi-step MDPs
Replaces instantaneous reward $r$ with long-term value $Q^pi (s,a)$
Policy gradient theorem applies to start state objective, average reward and average value objective, average reward and average value objective

Theorem
For any differentiable policy $pi_theta(s,a)$, for any of the policy objective functions $J=J_1,J_{avR}$ or $frac{1}{1-gamma} J_{avV}$, the policy gradient is
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q^{pi_theta} (s,a)right] $$

Monte-Carlo Policy Gradient(REINFORCE)

Update parameters by stochastic gradient ascent
Using policy gradient theorem
Using return $v_t$ as an unbiased sample of $Q^{pi_theta} (s_t,a_t)$
$$ Delta theta_t = alpha nabla_theta log pi_theta (s_t,a_t)v_t $$

Actor-Critic Policy Gradient

Introduction to AC

Reducing Variance Using a Critic

Monte-Carlo policy gradient still has high variance
We use a critic to estimate the action-value function, $ Q_w (s,a) approx Q^{pi_theta} (s,a)$
Actor-critic algorithms maintain two sets of parameters
- Critic Updates action-value function parameters $w$
- Actor Updates policy parameters $theta$, in direction suggested by critic
Actor-critic algorithms follow an approximate policy gradient
$$ nabla_theta J(theta) approx E_{pi_theta} (nabla_theta log pi_theta (s,a) Q_w (s,a) ) \ Delta theta = alpha nabla_theta log pi_theta (s,a) Q_w (s,a) $$

Estimating the Action-Value Function

The critic is solving a familiar problem: policy evaluation
How good is policy $pi_theta$ for current parameters $theta$?
This problem was explored in previous two lectures, e.g.
- Monte-Carlo policy evaluation
- Temporal-Difference learning
- TD($lambda$)
Could also use e.g least-squares policy evaluation

Action-value Actor-Critic

Simple actor-critic algorithm based on action-value critic
Using linear value fn approx $Q_w (s,a) = phi(s,a)^T w $
- Critic Updates $w$ by linear TD(0)
- Actor Updates $theta$ by policy gradient

Compatible Function Approximation

Bias in Actor-Critic Algorithms

Approximating the policy gradient introduces bias
A biased policy gradient may not find the right solution
- e.g if $Q_w(s,a)$ uses aliased features, can we solve gridworld example?
Luckily, if we choose value function approximation carefully
Then we can avoid introducing any bias
i.e We can still follow the exact policy gradient

Compatible Function Approximation

Theorem(Compatible Function Approximation Theorem)
If the following two conditions are satisfied

Value function approximator is compatible to the policy
$$ nabla_w Q_w(s,a) = nabla_theta log pi_theta (s,a) $$

Value function parameters $w$ minimise the mean-squared error
$$ epsilon = E_{pi_theta} left[ (Q^{pi_theta}(s,a) -Q_w(s,a))^2right] $$
Then the policy gradient is exact
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$

Proof of Compatible Function Approximation Theorem
If $w$ is chosen to minimise mean-squared error, gradient of $epsilon$ w.r.t $w$ must be zero

So $Q_w (s,a)$ can be substituted directly into the policy gradient
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$

Advantage Function Critic

Reducing Variance Using a Baseline

We subtract a baseline function $B(s)$ from the policy gradient
This can reduce variance, without changing expectation
$$ E_{pi_theta} left[ nabla_theta log pi_theta (s,a) B(s) right] = sum_{s in S} d^{pi_theta} (s) sum_s nabla_theta pi_theta (s,a) B(s) \ = sum_{s in S} d^{pi_theta} B(s) nabla_theta sum_{a in A} pi_theta (s,a) = 0 $$
A good baseline is the state value function $B(s) = V^{pi_theta} (s)$
So we can rewrite the policy gradient using the advantage function $A^{pi_theta} (s,a)$
$$ A^{pi_theta} (s,a) = Q^{pi_theta} (s,a) - V^{pi_theta} (s) \ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] $$

Estimating the Advantage Function

The advantage function can significantly reduce variance of policy gradient
So the critic should really estimate the advantage function
For example, by estimating both $V^{pi_theta} (s)$ and $Q^{pi_theta} (s,a)$
Using two function approximating and two parameter vectors
$$ V_v (s) approx V^{pi_theta} (s) \ Q_w (s,a) approx Q^{pi_theta} (s,a) \ A(s,a) = Q_w(s,a) - V_v (s) $$
And updating both value functions by e.g TD learning
For the true value function $V^{pi_theta} (s)$, the TD error $delta^{pi_theta}$
$$ delta^{pi_theta} = r + gamma V^{pi_theta} (s’) - V^{pi_theta} (s)$$
is an unbiased estimate of the advantage function
$$ E_{pi_theta} left[ delta^{pi_theta} | s,aright]= E_{pi_theta} left[r+gamma V^{pi_theta} (s’) | s,a right] - V^{pi_theta} (s) \ = Q^{pi_theta} (s,a) -V^{pi_theta} (s) = A^{pi_theta} (s,a) $$
So we can use the TD error to compute the policy gradient
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) delta^{pi_theta} right] $$
In practice we can use an approximate TD error
$$ delta_v = r + gamma V_v (s’) - V_v (s) $$
This approach only requires one set of critic parameters $v$

Eligibility Traces

Critics at Different Time-Scales

Critic can estimate value function $V_theta(s)$ from many targets at different time-scales
- For MC, the target is the return $v_t$
  $$ Delta theta = alpha (v_t - V_theta(s)) phi(s) $$
- For TD(0), the target is the TD target $r+ gamma V(s’)$
  $$ Delta theta = alpha (r + gamma V(s’) - V_theta(s)) phi(s) $$
- For forward-view TD($lambda$), the target is the $lambda$-return $v_t^lambda$
  $$ Delta theta =alpha (v_t^lambda - V_theta (s)) phi(s) $$
- For backward-view TD($lambda$), we use eligibility traces
  $$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_t = gamma lambda e_{t-1} + phi(s_t) \ Delta theta = alpha delta_t e_t $$

Policy Gradient with Eligibility Traces

Just like forward-view TD($lambda$), we can mix over time-scales
$$ Delta theta = alpha (v_t^lambda - V_v (s_t) ) nabla_theta log pi_theta (s_t,a_t) $$
where $v_t^lambda -V_v(s_t)$ is a biased estimate of advantage fn
Like backward-view TD($lambda$), we can also use eligibility traces
- By equivalence with TD($lambda$), substituting $phi(s) = nabla_theta log pi_theta (s,a)$
  $$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_{t+1} = lambda e_{t} + nabla_theta log pi_theta (s,a) \ Delta theta = alpha delta_t e_t $$
This update can be applied online, to incomplete sequences

Natural Policy Gradient

Alternative Policy Gradient Direction

Gradient ascent algorithm can follow any ascent direction
A good ascent direction can significantly speed convergence
Also, a policy can often be reparametrised without changing action probabilities
For example, increasing score of all actions in a softmax policy
The vanilla gradient is sensitive to these reparametrisations

Natural Policy Gradient

The natural policy gradient is parametrisation independent
It finds ascent direction that is closet to vanilla gradient, when changing policy by a small, fixed amount
$$ nabla_theta^{nat} pi_theta (s,a) = G_theta^{-1} nabla_theta pi_theta (s,a) $$
where $G_theta$ is the Fisher information matrix
$$ G_theta = E_theta left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T right] $$

Natural Actor-Critic

Using compatible function approximation
$$ nabla_w A_w (s,a) = nabla_theta log pi_theta (s,a) $$
So the natural policy gradient simplifies,
$$ nabla_theta J(theta) = E_{theta_pi} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] \ = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T w right] = G_theta w $$
so we can get $ nabla_theta^{nat} J(theta) = w $
update actor parameters in direction of critic parameters

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms
Each leads a stochastic gradient ascent algorithm
Critic uses policy evaluation (e.g MC or TD learning) to estimate $Q^pi (s,a), A^pi (s,a)$ or $V^pi (s)$