lecture notes for reinforcement learning (mdp)

Reference: David Silver, UCL reinforcement learning, lecture 2; CS 294 Deep Reinforcement Learning, Fall 2017

Markov Process (or Markov Chain)

Here we assume that the environment is fully observable, which means “the current state completely characterises the process”.

A Markov process is a tuple $left< S, P right>$, where $S$ denotes the state space (discrete or continuous) and $P$ denotes the state transition probability matrix.

Sometimes we use the transition operator $T$ instead of $P$, but why it is an operator? Suppose we have a vector of probabilities $vec{mu}_t$ (which means we are not completely sure which state we are actually in), and let $mu_{t,i} = p(s_t = i)$. Then $vec{mu}_{t+1}$ can be obtain by $Tvec{mu}_t$, so it is a linear operator. If states are continuous it is still an infinitely large linear operator.

Therefore, an agent has nothing to do in a Markov process, who just translates from one state to another state following by the transition probability.

Markov Reward Process

A Markov process converts to a Markov reward process, when the reward function $R$ and discount factor $gamma$ are introduced to it. An agent walks around (still totally depends on environment) and environment will give it a signal (reward).

Markov Decision Process

  • Bellman Expectation Equation

    The state-value function:

    The action-value function:

    We can reform Bellman expectation equation, by replacing $v_pi(s)$ with $q_pi(s, a)$ or replacing $q_pi(s, a)$ with $v_pi(s)$ on the right-hand side. It’s very very important to understand the relationship between $v_pi(s)$ and $q_pi(s, a)$ here. We will find the equation appear everywhere in the following lectures. Same as Markov process, we can find the direct closed form solution of Bellman expectation equation when given $gamma$, $R_pi$ and $P_pi$.

  • Bellman Optimal Equation

    In the same way, we can reform the Bellman Optimal Equation by $v_*$ or $q_*$. The $action$ in $q_*(s, a)$ is considered to be an input, after that the agent follows $pi^*$ and gains all returns. $v_*$ equals to $q_*$ only if the $action$ is $a^*$. There is no closed form solution of Bellman optimal equation. The $max()$ function is introduced to Bellman Optimal Equation, so it’s not linear. Instead, we can use other iteration approaches: policy iteration, value iteration, Q learning, SARSA, etc.