Overview
The aim of this project is ultimately to “solve” (i.e. find a Nash Equilibrium) for the two-person card game Bluff Poker. The game is similar to Liar’s Poker, a game associated with Wall Street traders, but is more complex, allowing for claims of straights, flushes, etc., as in the Texas Hold’em variant. A full set of rules can be found in the GitHub repository linked below.
The overall approach is to simplify, solve, and then add back complexity.
Approach & Iterations
Step 1: One Card Each
The first sub-goal was to solve a one-card-each version of the game. The state space here is very small: indeed, brute-force value iteration of the entire state space (taking care to allow stochastic strategies) is enough. This provided a good baseline truth against which I tried out various Reinforcement Learning techniques:
– Q-Value Tables
– Hyperparameter tuning (learning rate, exploration–exploitation, etc.)
– Best-Response / Mixing Loop (as described in
“Fictitious Self-Play in Extensive-Form Games”
)
– Monte Carlo vs SARSA methods
These methods were strong enough to give near-optimal performance for the 1-card version of the game. My full (somewhat meandering) thoughts can be found here.
Encouraged, I moved on to the 2-card version of the game.
Step 2: Two Cards Each
The state space is now large enough that exhuastive Q-Tables do not work: function approximation is needed. I first tried linear function approximation, representing states with feature vectors and strategies with a weight matrix. This restricted the class of possible strategies I could learn, but allowed me to implement gradient descent easily. Unfortunately, while this approach had modest success for small decks (n=2 => deck size 12), it failed to learn even basic strategies for larger n, suggesting scalability requires more effort. Full thoughts can be read here (coming soon).
Next Steps
– Assess feature and output representation: model should easily be able to associate e.g. trip 4s with pair of 4s.
– Move from linear function approximation to full-fledged Neural Networks.
– Explore traditional non-RL methods to establist some baseline (Counterfactual Regret?)