reinforcement learning 3: finite markov decision processes

“we introduce the formal problem of finite MDPs which we try to solve in the rest of the book” What???

trade-off between immediate and delayed reward

3.1 The Agent-Environment Interface

3.2 Goals and Rewards

3.3 Returns and Episodes

3.4 Unified Notation for Episodic and Continuing Tasks

3.5 Policies and Value Functions

policy: $pi(a|s)$ is the probability that $A_t = a$ if $S_t = s$.

value function for policy $pi$: $v_{pi}(s)$

action-value function for policy $pi$: $q_{pi}(s,a)$, the value of taking action $a$ in state $s$ under a policy $pi$.

The value functions $v_{pi}$ and $q_{pi}$ can be estimated from experience, Monte Carlo methods.

$v_{pi}(s) = sum_a pi(a | s) sum_{s', r} p(s', r | s, a)][r + gamma v_{pi}(s')]$

Example 3.6 Gridworld

Example 3.7 Golf

3.6 Optimal Policies and Optimal Value Functions

optimal policies $pi_*$

optimal state-value function $v_*(s) = max_pi v_pi(s)$

optimal action-value function $q_*(s,a) = max_{pi} q_pi(s,a)$

because optimal policies is greedy, so is it possible to learn by EM?
for current value function, learn the optimal policy.
for current policy, update value function.

reinforcement learning 3: finite markov decision processes

3.1 The Agent-Environment Interface

3.2 Goals and Rewards

3.3 Returns and Episodes

3.4 Unified Notation for Episodic and Continuing Tasks

3.5 Policies and Value Functions

3.6 Optimal Policies and Optimal Value Functions

3.7 Optimality and Approximation

3.8 Summary

近期文章

近期评论

标签

热门

文章归档

分类目录

功能