深度强化学习笔记

Basic ideas of Reinforcement learning

  • Interaction between an active decision-making agent and its environment
  • Used for many sequential decision making and control problems
  • Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.
  • Discover which actions yield the most reward by trying them

RL: given observation, output action, receive reward, with unknown and stochastic dependence on action and observation, AND we perform a sequence of actions, and states depend on previous actions

Some important characters of RL:

  • learning system’s actions influence its later inputs
  • the learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them out.
  • actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards
  • it uses training information that evaluates the actions taken rather than instructs by giving correct actions

Challenges of RL: trade off between exploration and exploitation

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.

  • Explore: to discover such actions, it has to try actions that it has not selected before in order to make better action selections in the future
  • Exploit: exploit what it already knows in order to obtain reward
    Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

Four main elements of RL:

  1. Policy: a mapping from perceived states of the environment to actions to be taken when in those states
  2. Reward: the immediate, intrinsic desirability of environmental states on each time step
  3. Value function: the total amount of reward an agent can expect to accumulate over the future, starting from that state
  4. Model of environment: something that mimics the behavior of the environment that allows inferences to be made about how the environment will behave

A basic example of RL:

You are faced repeatedly with a choice among k different options, or actions.
After each choice you receive a numerical reward chosen from a stationary
probability distribution that depends on action you selected.
Your objective is to maximize the expected total reward over some time period,
for example, over 1000 action selections, or time steps.

bandit algorithm

Greedy algorithm:

select the action (or one of the actions) with highest estimated action value

epsilon greedy algorithm:

behave greedily most of the time, but every once in a while, say with small probability epsilon, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates

which algoritm to choose:

The advantage of “epsilon-greedy over greedy methods depends on the task.
With noisier rewards it takes more exploration to find the optimal action, and “epsilon-greedy methods should fare even better relative to the greedy method.
On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.