Reference: David Silver, UCL reinforcement learning, lecture 2; CS 294 Deep Reinforcement Learning, Fall 2017
Markov Process (or Markov Chain)
Here we assume that the environment is fully observable, which means “the current state completely characterises the process”.
A Markov process is a tuple $left< S, P right>$, where $S$ denotes the state space (discrete or continuous) and $P$ denotes the state transition probability matrix.
Sometimes we use the transition operator $T$ instead of $P$, but why it is an operator? Suppose we have a vector of probabilities $vec{mu}_t$ (which means we are not completely sure which state we are actually in), and let $mu_{t,i} = p(s_t = i)$. Then $vec{mu}_{t+1}$ can be obtain by $Tvec{mu}_t$, so it is a linear operator. If states are continuous it is still an infinitely large linear operator.
Therefore, an agent has nothing to do in a Markov process, who just translates from one state to another state following by the transition probability.
Markov Reward Process
A Markov process converts to a Markov reward process, when the reward function $R$ and discount factor $gamma$ are introduced to it. An agent walks around (still totally depends on environment) and environment will give it a signal (reward).
Markov Decision Process
-
Bellman Expectation Equation
The state-value function:
The action-value function:
We can reform Bellman expectation equation, by replacing $v_pi(s)$ with $q_pi(s, a)$ or replacing $q_pi(s, a)$ with $v_pi(s)$ on the right-hand side. It’s very very important to understand the relationship between $v_pi(s)$ and $q_pi(s, a)$ here. We will find the equation appear everywhere in the following lectures. Same as Markov process, we can find the direct closed form solution of Bellman expectation equation when given $gamma$, $R_pi$ and $P_pi$.
-
Bellman Optimal Equation
In the same way, we can reform the Bellman Optimal Equation by $v_*$ or $q_*$. The $action$ in $q_*(s, a)$ is considered to be an input, after that the agent follows $pi^*$ and gains all returns. $v_*$ equals to $q_*$ only if the $action$ is $a^*$. There is no closed form solution of Bellman optimal equation. The $max()$ function is introduced to Bellman Optimal Equation, so it’s not linear. Instead, we can use other iteration approaches: policy iteration, value iteration, Q learning, SARSA, etc.
近期评论