rl7 policy gradient

reference:
UCL Course on RL

Introduction

Policy-Based Reinforcement Learning

  • In the last lecture we approximated the value or action-value function using parameter $theta$
  • A policy was generated directly from the value function
  • In this lecture, we will directly parametrise the policy
    $$ pi_{theta} (s,a) = P(a|s,theta) $$
  • We will focus again on model-free reinforcement learning

Value-Based and Policy-Based RL

  • Value Based
    • Learnt Value Function
    • Implicit policy (e.g. $epsilon$-greedy)
  • Policy Based
    • No Value Function
    • Learnt Policy
  • Actor-Critic
    • Learnt Value Function
    • Learnt Policy

Advantages of Policy-Based RL
Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies
    Disadvantages:
  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Policy Objective Functions

  • Goal: given policy $pi_{theta} (s,a) $ with parameters $theta$, find best $theta$.
  • But how do we measure the quality of a policy $pi_theta$?
  • In episodic environments we can use the start value.
    $$ J_1 (theta) = V^{pi_theta} (s_1) = E_{pi_theta} (v_1) $$
  • In continuing environments we can use the average value
    $$ J_{avV} (theta) = sum_{s} d^{pi_theta} (s) V^{pi_theta} (s) $$
  • Or the average reward per time-step
    $$ J_{avR} (theta) = sum_{s} d^{pi_theta} (s) sum_{a} pi_theta (s,a) R_s^a $$
  • where $d^{pi_theta} (s)$ is stationary distribution of Markov chain for $pi_theta$

Policy Optimisation

  • Policy based reinforcement learning is an optimisation problem
  • Find $theta$ that maximises $J(theta)$
  • Some approaches do not use gradient
    • Hill climbing
    • Simplex/amoeba/Nelder Mead
    • Genetic algorithms
  • Greater efficiency often possible using gradient
    • Gradient descent
    • Conjugate gradient
    • Quasi-Newton
  • We focus on gradient descent, many extensions possible
  • And on methods that exploit sequential structure

Finite Difference Policy Gradient

Policy Gradient

  • Let $J(theta)$ be any policy objective function
  • Policy gradient algorithms search for a local maximum in $J(theta)$ by ascending the gradient of the policy, w.r.t parameters $theta$
    $$ Delta theta = alpha nabla_theta J(theta) $$
  • Where $Delta_theta J(theta)$ is the policy gradient, and $alpha$ is a step-size parameter

Computing Gradients By Finite Differences

  • To evaluate policy gradient of $pi_theta (s,a)$
  • For each dimension $k in [1,n]$
    • Estimate $k$th partial derivative of objective function w.r.t $theta$
    • By perturbing $theta$ by small amount $epsilon$ in $k$th dimension
  • Uses $n$ evaluations to compute policy gradient in $n$ dimensions
  • Simple, noisy, inefficient - but sometimes effective
  • Works for arbitrary policies, even if policy is not differentiable

Monte-Carlo Policy Gradient

Likelihood Ratios

Score Function

  • We now compute the policy gradient analytically
  • Assume policy $pi_theta$ is differentiable whenever it is non-zero
  • and we know the gradient $nabla_theta pi_theta (s,a) $
  • Likelihood ratios exploit the following identity
    $$ nabla_theta pi_theta (s,a) = pi_theta (s,a) frac{nabla_theta pi_theta (s,a)}{pi_theta (s,a)} = pi_{theta} (s,a) nabla_theta log pi_theta (s,a) $$
  • The score function is $nabla_theta log pi_{theta} (s,a)$

Softmax Policy

  • We will use a softmax policy as a running example
  • Weight actions using linear combination of features $phi(s,a)^T theta$
  • Probability of action is proportional to exponential weight
    $$ pi_theta (s,a) propto e^{phi(s,a)^T theta} $$
  • The score function is
    $$ nabla_theta log pi_theta (s,a) = phi(s,a) - E_{pi_theta} (phi(s,cdot)) $$

Gaussian Policy

  • In continuous action spaces, a Gaussian policy is natural
  • Mean is a linear combination of state feature $mu(s) = phi(s)^T theta$
  • Variance may be fixed $sigma^2$, or can also parametrised
  • Policy is Gaussian, $a sim N(mu(s),sigma^2)$
  • The score function is
    $$ nabla_theta log pi_{theta} (s,a) = frac{(a-mu(s))phi(s)}{sigma^2} $$

Policy Gradient Theorem

One-Step MDPs

  • Consider a simple class of one-step MDPs
    • Starting in state $ssim d(s)$
    • Terminating after one time-step with reward $r=R_{s,a}$
  • Use likelihood ratios to compute the policy gradient
    $$ J(theta) = E_{pi_theta} (r) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) R_{s,a} \ nabla_theta J(theta) = sum_{s in S} d(s) sum_{a in A} pi_theta (s,a) nabla_theta log pi_theta (s,a) R_{s,a} = E_{pi_theta} (nabla_theta log pi_theta (s,a) r) $$

Policy Gradient Theorem

  • The policy gradient theorem generalised the likelihood ratio approach to multi-step MDPs
  • Replaces instantaneous reward $r$ with long-term value $Q^pi (s,a)$
  • Policy gradient theorem applies to start state objective, average reward and average value objective, average reward and average value objective

    Theorem
    For any differentiable policy $pi_theta(s,a)$, for any of the policy objective functions $J=J_1,J_{avR}$ or $frac{1}{1-gamma} J_{avV}$, the policy gradient is
    $$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q^{pi_theta} (s,a)right] $$

Monte-Carlo Policy Gradient(REINFORCE)

  • Update parameters by stochastic gradient ascent
  • Using policy gradient theorem
  • Using return $v_t$ as an unbiased sample of $Q^{pi_theta} (s_t,a_t)$
    $$ Delta theta_t = alpha nabla_theta log pi_theta (s_t,a_t)v_t $$
    PG algorithm

Actor-Critic Policy Gradient

Introduction to AC

Reducing Variance Using a Critic

  • Monte-Carlo policy gradient still has high variance
  • We use a critic to estimate the action-value function, $ Q_w (s,a) approx Q^{pi_theta} (s,a)$
  • Actor-critic algorithms maintain two sets of parameters
    • Critic Updates action-value function parameters $w$
    • Actor Updates policy parameters $theta$, in direction suggested by critic
  • Actor-critic algorithms follow an approximate policy gradient
    $$ nabla_theta J(theta) approx E_{pi_theta} (nabla_theta log pi_theta (s,a) Q_w (s,a) ) \ Delta theta = alpha nabla_theta log pi_theta (s,a) Q_w (s,a) $$

Estimating the Action-Value Function

  • The critic is solving a familiar problem: policy evaluation
  • How good is policy $pi_theta$ for current parameters $theta$?
  • This problem was explored in previous two lectures, e.g.
    • Monte-Carlo policy evaluation
    • Temporal-Difference learning
    • TD($lambda$)
  • Could also use e.g least-squares policy evaluation

Action-value Actor-Critic

  • Simple actor-critic algorithm based on action-value critic
  • Using linear value fn approx $Q_w (s,a) = phi(s,a)^T w $
    • Critic Updates $w$ by linear TD(0)
    • Actor Updates $theta$ by policy gradient
      Simple actor-critic algorithm

Compatible Function Approximation

Bias in Actor-Critic Algorithms

  • Approximating the policy gradient introduces bias
  • A biased policy gradient may not find the right solution
    • e.g if $Q_w(s,a)$ uses aliased features, can we solve gridworld example?
  • Luckily, if we choose value function approximation carefully
  • Then we can avoid introducing any bias
  • i.e We can still follow the exact policy gradient

Compatible Function Approximation

Theorem(Compatible Function Approximation Theorem)
If the following two conditions are satisfied

  1. Value function approximator is compatible to the policy
    $$ nabla_w Q_w(s,a) = nabla_theta log pi_theta (s,a) $$
  2. Value function parameters $w$ minimise the mean-squared error
    $$ epsilon = E_{pi_theta} left[ (Q^{pi_theta}(s,a) -Q_w(s,a))^2right] $$
    Then the policy gradient is exact
    $$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$

Proof of Compatible Function Approximation Theorem
If $w$ is chosen to minimise mean-squared error, gradient of $epsilon$ w.r.t $w$ must be zero
Proof
So $Q_w (s,a)$ can be substituted directly into the policy gradient
$$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) Q_w (s,a) right] $$

Advantage Function Critic

Reducing Variance Using a Baseline

  • We subtract a baseline function $B(s)$ from the policy gradient
  • This can reduce variance, without changing expectation
    $$ E_{pi_theta} left[ nabla_theta log pi_theta (s,a) B(s) right] = sum_{s in S} d^{pi_theta} (s) sum_s nabla_theta pi_theta (s,a) B(s) \ = sum_{s in S} d^{pi_theta} B(s) nabla_theta sum_{a in A} pi_theta (s,a) = 0 $$
  • A good baseline is the state value function $B(s) = V^{pi_theta} (s)$
  • So we can rewrite the policy gradient using the advantage function $A^{pi_theta} (s,a)$
    $$ A^{pi_theta} (s,a) = Q^{pi_theta} (s,a) - V^{pi_theta} (s) \ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] $$

Estimating the Advantage Function

  • The advantage function can significantly reduce variance of policy gradient
  • So the critic should really estimate the advantage function
  • For example, by estimating both $V^{pi_theta} (s)$ and $Q^{pi_theta} (s,a)$
  • Using two function approximating and two parameter vectors
    $$ V_v (s) approx V^{pi_theta} (s) \ Q_w (s,a) approx Q^{pi_theta} (s,a) \ A(s,a) = Q_w(s,a) - V_v (s) $$
  • And updating both value functions by e.g TD learning
  • For the true value function $V^{pi_theta} (s)$, the TD error $delta^{pi_theta}$
    $$ delta^{pi_theta} = r + gamma V^{pi_theta} (s’) - V^{pi_theta} (s)$$
  • is an unbiased estimate of the advantage function
    $$ E_{pi_theta} left[ delta^{pi_theta} | s,aright]= E_{pi_theta} left[r+gamma V^{pi_theta} (s’) | s,a right] - V^{pi_theta} (s) \ = Q^{pi_theta} (s,a) -V^{pi_theta} (s) = A^{pi_theta} (s,a) $$
  • So we can use the TD error to compute the policy gradient
    $$ nabla_theta J(theta) = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) delta^{pi_theta} right] $$
  • In practice we can use an approximate TD error
    $$ delta_v = r + gamma V_v (s’) - V_v (s) $$
  • This approach only requires one set of critic parameters $v$

Eligibility Traces

Critics at Different Time-Scales

  • Critic can estimate value function $V_theta(s)$ from many targets at different time-scales
    • For MC, the target is the return $v_t$
      $$ Delta theta = alpha (v_t - V_theta(s)) phi(s) $$
    • For TD(0), the target is the TD target $r+ gamma V(s’)$
      $$ Delta theta = alpha (r + gamma V(s’) - V_theta(s)) phi(s) $$
    • For forward-view TD($lambda$), the target is the $lambda$-return $v_t^lambda$
      $$ Delta theta =alpha (v_t^lambda - V_theta (s)) phi(s) $$
    • For backward-view TD($lambda$), we use eligibility traces
      $$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_t = gamma lambda e_{t-1} + phi(s_t) \ Delta theta = alpha delta_t e_t $$

Policy Gradient with Eligibility Traces

  • Just like forward-view TD($lambda$), we can mix over time-scales
    $$ Delta theta = alpha (v_t^lambda - V_v (s_t) ) nabla_theta log pi_theta (s_t,a_t) $$
  • where $v_t^lambda -V_v(s_t)$ is a biased estimate of advantage fn
  • Like backward-view TD($lambda$), we can also use eligibility traces
    • By equivalence with TD($lambda$), substituting $phi(s) = nabla_theta log pi_theta (s,a)$
      $$ delta_t = r_{t+1} + gamma V(s_{t+1}) - V(s_t) \ e_{t+1} = lambda e_{t} + nabla_theta log pi_theta (s,a) \ Delta theta = alpha delta_t e_t $$
  • This update can be applied online, to incomplete sequences

Natural Policy Gradient

Alternative Policy Gradient Direction

  • Gradient ascent algorithm can follow any ascent direction
  • A good ascent direction can significantly speed convergence
  • Also, a policy can often be reparametrised without changing action probabilities
  • For example, increasing score of all actions in a softmax policy
  • The vanilla gradient is sensitive to these reparametrisations

Natural Policy Gradient

  • The natural policy gradient is parametrisation independent
  • It finds ascent direction that is closet to vanilla gradient, when changing policy by a small, fixed amount
    $$ nabla_theta^{nat} pi_theta (s,a) = G_theta^{-1} nabla_theta pi_theta (s,a) $$
  • where $G_theta$ is the Fisher information matrix
    $$ G_theta = E_theta left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T right] $$

Natural Actor-Critic

  • Using compatible function approximation
    $$ nabla_w A_w (s,a) = nabla_theta log pi_theta (s,a) $$
  • So the natural policy gradient simplifies,
    $$ nabla_theta J(theta) = E_{theta_pi} left[ nabla_theta log pi_theta (s,a) A^{pi_theta} (s,a) right] \ = E_{pi_theta} left[ nabla_theta log pi_theta (s,a) nabla_theta log pi_theta (s,a)^T w right] = G_theta w $$
    so we can get $ nabla_theta^{nat} J(theta) = w $
  • update actor parameters in direction of critic parameters

Summary of Policy Gradient Algorithms

  • The policy gradient has many equivalent forms
    PG equivalent forms
  • Each leads a stochastic gradient ascent algorithm
  • Critic uses policy evaluation (e.g MC or TD learning) to estimate $Q^pi (s,a), A^pi (s,a)$ or $V^pi (s)$