
Basic ideas of Reinforcement learning
- Interaction between an active decision-making agent and its environment
- Used for many sequential decision making and control problems
- Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.
- Discover which actions yield the most reward by trying them
RL: given observation, output action, receive reward, with unknown and stochastic dependence on action and observation, AND we perform a sequence of actions, and states depend on previous actions
Some important characters of RL:
- learning system’s actions influence its later inputs
- the learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them out.
- actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards
- it uses training information that evaluates the actions taken rather than instructs by giving correct actions
Challenges of RL: trade off between exploration and exploitation
To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.
- Explore: to discover such actions, it has to try actions that it has not selected before in order to make better action selections in the future
- Exploit: exploit what it already knows in order to obtain reward
Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.
Four main elements of RL:
- Policy: a mapping from perceived states of the environment to actions to be taken when in those states
- Reward: the immediate, intrinsic desirability of environmental states on each time step
- Value function: the total amount of reward an agent can expect to accumulate over the future, starting from that state
- Model of environment: something that mimics the behavior of the environment that allows inferences to be made about how the environment will behave
A basic example of RL:
|
bandit algorithm
Greedy algorithm:
select the action (or one of the actions) with highest estimated action value
epsilon greedy algorithm:
behave greedily most of the time, but every once in a while, say with small probability epsilon, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates
which algoritm to choose:
The advantage of “epsilon-greedy over greedy methods depends on the task.
With noisier rewards it takes more exploration to find the optimal action, and “epsilon-greedy methods should fare even better relative to the greedy method.
On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.




近期评论