深度强化学习笔记

Basic ideas of Reinforcement learning

Interaction between an active decision-making agent and its environment
Used for many sequential decision making and control problems
Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.
Discover which actions yield the most reward by trying them

RL: given observation, output action, receive reward, with unknown and stochastic dependence on action and observation, AND we perform a sequence of actions, and states depend on previous actions

Some important characters of RL:

learning system’s actions influence its later inputs
the learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them out.
actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards
it uses training information that evaluates the actions taken rather than instructs by giving correct actions

Challenges of RL: trade off between exploration and exploitation

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.

Explore: to discover such actions, it has to try actions that it has not selected before in order to make better action selections in the future
Exploit: exploit what it already knows in order to obtain reward
Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

Four main elements of RL:

Policy: a mapping from perceived states of the environment to actions to be taken when in those states
Reward: the immediate, intrinsic desirability of environmental states on each time step
Value function: the total amount of reward an agent can expect to accumulate over the future, starting from that state
Model of environment: something that mimics the behavior of the environment that allows inferences to be made about how the environment will behave

A basic example of RL:

You are faced repeatedly with a choice among k different options, or actions. 
After each choice you receive a numerical reward chosen from a stationary 
probability distribution that depends on  action you selected. 
Your objective is to maximize the expected total reward over some time period, 
for example, over 1000 action selections, or time steps.

bandit algorithm

Greedy algorithm:

select the action (or one of the actions) with highest estimated action value

behave greedily most of the time, but every once in a while, say with small probability epsilon, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates

which algoritm to choose:

The advantage of “epsilon-greedy over greedy methods depends on the task.
With noisier rewards it takes more exploration to find the optimal action, and “epsilon-greedy methods should fare even better relative to the greedy method.
On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.

深度强化学习笔记

Basic ideas of Reinforcement learning

Some important characters of RL:

Challenges of RL: trade off between exploration and exploitation

Four main elements of RL:

A basic example of RL:

bandit algorithm

Greedy algorithm:

epsilon greedy algorithm:

which algoritm to choose:

近期文章

近期评论

标签

热门

文章归档

分类目录

功能