ai planning for autonomy

Search algorithms

  • BrFS: Use queue, layer by layer
  • ID: Iteratively increase depth of depth-first search
  • DFS: Use stack
  • Best First Search: Only consider heuristic
  • Dijkstra: Only consider cost
  • A*: Cost + Heuristic using priority queue
  • Hill climbing

Heuristic

  • Admissible: Optimal solution with reopening
  • Admissible and consistent: Optimal solution without reopening
  • Admissible + goal-aware -> Consistent

Exploration vs Exploitation

  • UCTS

Tao Lu

Iterative Deepening

A*

  • Init priority queue using cost + huristic
  • Add init state
  • Keep dequeueing and expand node until find goal

STRIPES Modeling

P = <F, O, I, G>

Bellman-Ford

  • Init a table with first row
    • any state which is in init state are initialized to 0 otherwise $infty$
  • go row by row, update the values with the minimum of
    • val on last row
    • pre condition value + cost. If more than one precondition, take max or sum depending on you are calculating $h^{max}$ or $h^{add}$
  • Repeat until no cell changes from last row.
  • Calculat $h^{add}$ or $h^{max}$ by taking sum or max of goal state atoms from last row.

Best support

  • Backtrack from the goal, choost actions that minimize cost(including the cost to reach its pre ), open nodes, and repeat until open becomes empty.
    • If $h^{max}$, take the maximum of precondition heuristics and add the cost of action to current state
    • If $h^{add}$, take the sum of precondition heuristics and add the cost of action to current state
  • Summing up costs of actions to get $h^{FF}$

IW

$IW(k)$ means performing breadth-first search and prune newly generated states with novelty > k. Iteratively increase k until problem solved or reach limit.

  • Perform node expanding same as BFS
  • Calculate novelty, prune of it’s larger than k

MDP

When MDP is known, we can use value iteration or policy iteration to work out the optimal policy

Value Iteration

  • Given $gamma$
  • $V_{i+1}(s) = max_{a}sum P(s’|s)[r(s,a,s’) + gamma V_i(s’)]$
    • For each action, calculate the reward for next iteration
    • Select the highest reward

Policy Iteration

  • init policy
  • Policy Evaluation: Calculate $V^pi (s)$ for all action at all states using
  • Policy Improvement: Update the policy table for each state by $pi(s) = argmax_{ain A(s)}Q_pi (a,s)$
  • If changed, re-evaluate again

Reinforcement Learning

Reinforcement learning is used when the model of problem is unknown to us.

  • Q Learning:

  • SARSA:

Q-Learning differs from SARSA that it has the assumption that the next state is taking optimal action, but in fact it does not necessarily (agent may need to explore, thus non-optimal action is taken). While SARSA uses the Q of next step the update the current Q, which means if the agent take non-optimal action at $s’$, this will affect updating Q value at the current state. Therefore if $epsilon$-greedy is used, SARSA tend to train a more conservative policy instead of aggresive one.

  • N-Step SARSA
    replace original $r + gamma Q$ with

    Where all the rs and Qs are given in the problem statement