參考作者在Quora上的解釋
Introduction
self-learning algorithm
- learns strategy by repeatedly playing against itself
- initialized with uniformly random
- play every action at every decision point with equal probability
- simulates playing games against itself
- it gets closer and closer towards an optimal strategy: a strategy that can do no worse than tie against any opponent
Implementation
- Summing total regret for each action at each decision point
- regret: how much better if just always played this one action at this decision, instead of mixture actions?
- Positive regret means that we would have done better if we had taken that action more often
- Negative regret means that we would have done better by not taking that action at all
- After each game, it update new regret values
- actions with probabilities proportional to their positive regret
- regret: how much better if just always played this one action at this decision, instead of mixture actions?
- Counter-intuitively, sequence of strategies does not necessarily converge to anything useful
- in a two-player zero-sum game, if you compute the average strategy over those billions of strategies in the sequence, then that average strategy will converge towards Nash equilibrium of the game
Nash equilibrium(納許均衡)
- It can do no worse than tie against any other strategy
- Plays perfect defence
- Just wins when the opponent makes mistakes
- since attempting to find and exploit an opponent’s mistakes usually makes it possible for an even smarter opponent to exploit your new strategy
- Just wins when the opponent makes mistakes
- exploitability(利用度)
- maximum expectation that a perfect counter-strategy could win
- exploitability = 0 when Nash equilibrium
- CFR can make average strategy’s exploitability converges towards zero
Result
- best poker programs started beating the world’s best human players in heads-up limit hold’em in 2008, even though there were still massively exploitable by this worst-case measure
- In January 2015, we’ve essentially weakly solved the game
- a strategy with such a low exploitability (0.000986 big blinds per game) that it would take more than a human lifetime of play
- for anyone to have 95% statistical confidence that they were actually winning against it
- a strategy with such a low exploitability (0.000986 big blinds per game) that it would take more than a human lifetime of play
近期评论