game trees - R&N 5.2-5.5
- utilities / decision theory – R&N 16.1-16.3
- utility functions
decision theory / VPI – R&N 16.5 & 16.6
- the value of information
mdps and rl - R&N 17.1-17.4
reinforcement learning – R&N 21.1-21.6

From “Artificial Intelligence” Russel & Norvig 3rd Edition

like search (adversarial search)
minimax algorithm
- ply - half a move in a tree
- for multiplayer, the backed-up value of a node n is the vector of the successor state with the highest value for the player choosing at n
- time complexity - $O(b^m)$
- space complexity - $O(bm)$ or even $O(m)$
alpha-beta pruning cuts in half the exponential depth
- once we have found out enough about n, we can prune it
- depends on move-ordering
  - might want to explore best moves = killer moves first
- transposition table can hash different movesets that are just transpositions of each other
imperfect real-time decisions
- can evaluate nodes with a heuristic and cutoff before reaching goal
- heuristic uses features
- want quiescent search - consider if something dramatic will happen in the next ply
  - horizon effect - a position is bad but isn’t apparent for a few moves
  - singular extension - allow searching for certain specific moves that are always good at deeper depths
- forward pruning - ignore some branches
  - beam search - consider only n best moves
  - PROBCUT prunes some more
search vs lookup
- often just use lookup in the beginning
- program can solve and just lookup endgames
stochastic games
- include chance nodes
- change minimax to expectiminimax
- $O(b^m numRolls^m)$
- cutoff evaluation function is sensitive to scaling - evaluation function must be a positive linear transformation of the probability of winning from a position
- can do alpha-beta pruning analog if we assume evaluation function is bounded in some range
- alternatively, could simulate games with Monte Carlo simulation

utilities / decision theory – R&N 16.1-16.3

goal: maximize utility by taking actions (focus on single actions)
- utility function U(s) gives utility of a state
- actions are probabilistic: $P[RESULT(a)=s’ vert a,e]$
  - s - state, e - observations, a - action
soln: pick action with maximum expected utility
- expected utility $EU(avert e) = sum_{s’} P(RESULT(a)=s’ vert a,e) U(s’)$
notation
- A>B - agent prefers A over B
- A~B - agent is indifferent between A and B
preference relation has 6 axioms of utility theory
1. orderability - A>B, A~B, or A<B
2. transitivity
3. continuity
4. substitutability - can do algebra with preference eqns
5. monotonicity - if A>B then must prefer higher probability of A than B
6. decomposability - 2 consecutive lotteries can be compressed into single equivalent lottery
these axioms yield a utility function
- isn’t unique (ex. affine transformation yields new utility function)
- value function = ordinal utility function - sometimes ranking, numbers not needed
- agent might not be explicitly maximizing the utility function

utility functions

preference elicitation - finds utility function
- normalized utility to have min and max value
- assess utility of s by asking agent to choose between s and $(p: min, (1-p): max)$
people have complicated utility functions
- ex. micromort - one in a million chance of death
- ex. QALY - quality-adjusted life year
money
- agents exhibits monotonic preference for more money
- gambling has expected monetary value = EMV
- risk averse = when utility of money is sublinear
  - value agent will accept in lieu of lottery = certainty equivalent= insurance premium
- risk-neutral = linear
- risk-seeking = supralinear
optimizer’s curse - tendency for E[utility] to be too high because we keep picking high utility randomness
normative theory - how idealized agents work
descriptive theory - how actual agents work
- certainty effect - people are drawn to things that are certain
- ambiguity aversion
- framing effect - wording can influence people’s judgements
- anchoring effect - buy middle-tier wine because expensive is there

decision theory / VPI – R&N 16.5 & 16.6

note: here we are just making 1 decision
decision network (sometimes called influence diagram)
1. chance nodes - represent RVs (like BN)
2. decision nodes - points where decision maker has a choice of actions
3. utility nodes - represent agent’s utility function
can ignore chance nodes
- then action-utility function = Q-function maps directly from actions to utility

decision_nets

evaluation
1. set evidence
2. for each possible value of decision node
  - set decision node to that value
  - calculate probabilities of parents of utility node
  - calculate resulting utility
3. return action with highest utility

the value of information

information value theory - enables agent to choose what info to acquire
- observations only affect agent’s belief state
- value of info = difference in best expected value with/without info
- maximum $EU(alpha e) = underset{a}{max} sum_{s’} P(Result(a)=s’ a, e) U(s’)$
value of perfect information VPI - assume we can obtain exact evidence for a variable (ex. variable $T=t$)
- $VPI(T) = mathbb{E}_{T}left[ EU(alpha e, T) right] - underbrace{EU(alpha vert e)}_{text{original EU}}$
- first term expands to $sum_t P(T=t vert e) cdot EU(alpha vert e, T=t) $
  - within each of these EU, we take a max over actions
- VPI not linearly additive, but is order-independent
- intuition
  - info is more valuable when it is likely to cause a change of plan
  - info is more valuable when the new plan will be much better than the old plan
information-gathering agent
- myopic - greedily obtain evidence which yields highest VPI until some threshold
- conditional plan - considers more things

mdps and rl - R&N 17.1-17.4

sequences of actions
fully observable - agent knows its state
markov decision process - all these things are given
- set of states s
- set of actions a
- stochastic transition model $P(s’ vert s,a)$
- reward function R(s)
  - utility aggregates rewards, for models more complex than mdps reward can be a function of past sequences of actions / observations
want policy $pi (s)$ - what action to do in state s
- optimal policy yields highest expected utlity
optimizing MDP - multiattribute utility theory
- could sum rewards, but results are infinite
- instead define objective function (maps infinite sequences of rewards to single real numbers)
  - ex. discounting to prefer earlier rewards (most common)
    - discount reward n steps away by $gamma^n, 0<gamma<1$
  - ex. set a finite horizon and sum rewards
    - optimal action in a given state could change over time = nonstationary
  - ex. average reward rate per time step
  - ex. agent is guaranteed to get to terminal state eventually - proper policy
expected utility executing $pi$: $U^pi (s) = mathbb E_{s_1,…,s_t}left[sum_t gamma^t R(s_t)right]$
- when we use discounted utilities, $pi$ is independent of starting state
- $pi^*(s) = underset{pi}{argmax} : U^pi (s) = underset{a}{argmax} sum_{s’} P(s’ vert s,a) U’(s)$

value iteration

value iteration - calculates utility of each state and uses utilities to find optimal policy
- bellman eqn: $U(s) = R(s) + gamma : underset{a}{max} sum_{s’} P(s’ vert s, a) U(s’)$
- start with arbitrary utilities
- recalculate several times with Bellman update to approximate solns to bellman eqn
value iteration eventually converges
- contraction - function that brings variables together
  - contraction only has 1 fixed point
  - Bellman update is a contraction on the space of utility vectors and therefore converges
  - error is reduced by factor of $gamma$ each iteration
- also, terminating condition - if $ vert vert U_{i+1}-U_i vert vert < epsilon (1-gamma) / gamma$ then $ vert vert U_{i+1}-U vert vert <epsilon$
- what actually matters is policy loss $ vert vert U^{pi_i}-U vert vert $ - the most the agent can lose by executing $pi_i$ instead of the optimal policy $pi^*$
  - if $ vert vert U_i -U vert vert < epsilon$ then $ vert vert U^{pi_i} - U vert vert < 2epsilon gamma / (1-gamma)$

policy iteration

another way to find optimal policies
1. policy evaluation - given a policy $pi_i$, calculate $U_i=U^{pi_i}$, the utility of each state if $pi_i$ were to be executed
  - like value iteration, but with a set policy so there’s no max
  - $U_i(s) = R(s) + gamma : sum_{s’} P(s’ vert s, pi_i(s)) U_i(s’)$
  - can solve exactly for small spaces, or approximate (set of lin. eqs.)
2. policy improvement - calculate a new MEU policy $pi_{i+1}$ using $U_i$
  - same as above, just $pi^*(s) = underset{pi}{argmax} : U^pi (s) = underset{a}{argmax} sum_{s’} P(s’ vert s,a) U’(s)$
asynchronous policy iteration - don’t have to update all states at once

partially observable markov decision processes (POMDP)

agent is not sure what state it’s in
same elements but add sensor model $P(e vert s)$
have distr $b(s)$ for belief states
- updates like the HMM: $b’(s’) = alpha P(e vert s’) sum_s P(s’ vert s, a) b(s)$
- changes based on observations
optimal action depends only on the agent’s current belief state
- use belief states as the states of an MDP and solve as before
- changes because state space is now continuous
value iteration
1. expected utility of executing p in belief state is just $b cdot alpha_p$ (dot product)
2. $U(b) = U^{pi^*}(b)=underset{p}{max} : b cdot alpha_p$
  - belief space is continuous [0, 1] so we represent it as piecewise linear, and store these discrete lines in memory
    - do this by iterating and keeping any values that are optimal at some point
  - remove dominated plans
    - generally this is far too inefficient
dynamic decision network - online agent

reinforcement learning – R&N 21.1-21.6

reinforcement learning - use observed rewards to learn optimal policy for the environment
- in ch 17, agent had model of environment (P(s’ s, a) and R(s))
2 problems
- passive - given $pi$, learn $U^pi (s)$
- active - explore states to find utilities and exploit to get highest reward
2 model types, 3 agent designs
- model-based: can predict next state/reward before taking action (for MDP, requires learning $P(s’ s,a)$)
  - utility-based agent - learns utility function on states
    - requires model of the environment
- model-free
  - Q-learning agent: learns action-utility function = Q-function maps actions $to$ utility
  - reflex agent: learns policy that maps directly from states to actions

passive reinforcement learning

given policy $pi$, learn $U^pi (s) = mathbb Eleft[ sum_{t=0}^{infty} gamma^t R(S_t)right]$
- like policy evaluation, but transition model / reward function are unknown
direct utility estimation: treat states independently
- run trials to sample utility
- average to get expected total reward for each state = expected total reward from each state
adaptive dynamic programming (ADP) - sample to estimate transition model $P(s’ s, a)$ and rewards $R(s)$, then plug into Bellman eqn to find $U^pi(s)$ (plug in at each step)
- we might want to enforce a prior on the model (two ways)
  1. Bayesian reinforcement learning - assume a prior $P(h)$ on transition model h
    - use prior to calculate $P(h vert e)$
    - use $P(h e)$ to calculate optimal policy: $pi^* = underset{pi}{argmax} sum_h P(h vert e) u_h^pi$
    - $u_h^pi$= expected utility over all possible start states, obtained by executing policy $pi$ in model h
  2. give best outcome in the worst case over H (from robust control theory)
    - $pi^* = underset{pi}{argmax}: underset{h}{min} : u_h^pi$
temporal-difference learning - adjust utility estimates towards local equilibrium for correct utilities
- like an approximation of ADP
- when we transition $s to s’$, update $U^pi(s) = U^pi (s) + alpha left[R(s) - U^pi (s) + gamma :U^pi (s’) right]$
  - $alpha$ should decrease over time to converge
- prioritized sweeping - prefers to make adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates
  - speeds things up

active reinforcement learning

no longer following set policy
- explore states to find their utilities and exploit model to get highest reward
- must explore all actions, not just those in the policy
bandit problems - determining exploration policy
- n-armed bandit - pulling n levelers on a slot machine, each with different distr.
- Gittins index - function of number of pulls / payoff
coorect schemes should be GLIE - greedy in the limit of infinite exploration - visits all states infinitely, but eventually become greedy

agent examples

ex. choose random action $1/t$ of the time
ex. active adp agent
- give optimistic utility to relatively unexplored states
- uses exploration function f(u, numTimesVisited) around the sum in the bellman eqn
  - high utilities will propagate
ex. active TD agent
- now must learn transitions (same as adp)
- update rule same as passive TD

learning action-utility function

$U(s) = underset{a}{max} : Q(s,a)$
- ADP version: $Q(s, a) = R(s) + gamma sum_{s’} P(s’ s, a) underset{a’}{max} Q(s’, a’)$
- TD version: $Q(s,a) = Q(s,a) + alpha [R(s) - Q(s,a) + gamma : underset{a’}{max} Q(s’, a’)]$ - this is what is usually referred to as Q-learning
SARSA (state-action-reward-state-action) is related: $Q(s,a) = Q(s,a) + alpha [R(s) - Q(s,a) + gamma : Q(s’, a’)]$
- here, a’ is action actually taken
Q-learning is off-policy (only uses best Q-value)
- more flexible
SARSA is on-policy (pays attention to actual policy being followed)
can approximate Q-function with something other than a lookup table
- ex. linear function of parameters $hat{U}_theta(s) = theta_1f_1(s) + … + theta_n f_n(s)$
  - can learn params online with delta rule = wildrow-hoff rule: $theta_i = theta - alpha : frac{partial Loss}{partial theta_i}$

policy search

keep twiddling the policy as long as it improves, then stop
- store one Q-function (parameterized by $theta$) for each action
- ex. $pi(s) = underset{a}{max} : hat{Q}_theta (s,a)$
  - this is discontinunous, instead often use stochastic policy representation (ex. softmax for $pi_theta (s,a)$)
- learn $theta$ that results in good performance
  - Q-learning learns actual Q* function - could be different (scaling factor etc.)
to find $pi$ maximize policy value $p(theta) = $ expected reward executing $pi_theta$
- could do this with sgd using policy gradient
- when environment/policy is stochastic, more difficult
  1. could sample mutiple times to compute gradient
  2. REINFORCE algorithm - could approximate gradient at $theta$ by just sampling at $theta$: $nabla_theta p(theta) approx frac{1}{N} sum_{j=1}^N frac{(nabla_theta pi_theta (s, a_j)) R_j (s)}{pi_theta (s, a_j)}$
  3. PEGASUS - correlated sampling - ex. 2 blackjack programs would both be dealt same hands - want to see different policies on same things

gong qingfeng decision theory / VPI – R&N 16.5 & 16.6 mdps and rl – R&N 17.1-17.4 reinforcement learning – R&N 21.1-21.6

utilities / decision theory – R&N 16.1-16.3

utility functions

decision theory / VPI – R&N 16.5 & 16.6

the value of information

mdps and rl - R&N 17.1-17.4

value iteration

policy iteration

partially observable markov decision processes (POMDP)

reinforcement learning – R&N 21.1-21.6

passive reinforcement learning

active reinforcement learning

agent examples

learning action-utility function

policy search

近期文章

近期评论

标签

热门

文章归档

分类目录

功能