首页 > itarticle > cfr(counterfactual regret minimization) algorithm 簡介

cfr(counterfactual regret minimization) algorithm 簡介

admin 11月 14, 2020 0

參考作者在Quora上的解釋

Introduction

self-learning algorithm

learns strategy by repeatedly playing against itself
initialized with uniformly random
- play every action at every decision point with equal probability
- simulates playing games against itself
- it gets closer and closer towards an optimal strategy: a strategy that can do no worse than tie against any opponent

Implementation

Summing total regret for each action at each decision point
- regret: how much better if just always played this one action at this decision, instead of mixture actions?
  - Positive regret means that we would have done better if we had taken that action more often
  - Negative regret means that we would have done better by not taking that action at all
- After each game, it update new regret values
  - actions with probabilities proportional to their positive regret
Counter-intuitively, sequence of strategies does not necessarily converge to anything useful
- in a two-player zero-sum game, if you compute the average strategy over those billions of strategies in the sequence, then that average strategy will converge towards Nash equilibrium of the game

Nash equilibrium(納許均衡)

It can do no worse than tie against any other strategy
Plays perfect defence
- Just wins when the opponent makes mistakes
  - since attempting to find and exploit an opponent’s mistakes usually makes it possible for an even smarter opponent to exploit your new strategy
exploitability(利用度)
- maximum expectation that a perfect counter-strategy could win
- exploitability = 0 when Nash equilibrium
- CFR can make average strategy’s exploitability converges towards zero

Result

best poker programs started beating the world’s best human players in heads-up limit hold’em in 2008, even though there were still massively exploitable by this worst-case measure
In January 2015, we’ve essentially weakly solved the game
- a strategy with such a low exploitability (0.000986 big blinds per game) that it would take more than a human lifetime of play
  - for anyone to have 95% statistical confidence that they were actually winning against it

Example Code

Summary

參考資料