Peter Simone

Deep Reinforcement Learning - Blackjack

peter-simone@newsletter.paragraph.com (Peter Simone) — Thu, 15 Jun 2023 03:14:46 GMT

In my previous post, I conducted a fairly thorough walkthrough of the game of blackjack, reinforcement learning, Q learning, and various analyses. In this post, I’ll introduce the concept of Deep Reinforcement Learning (or, Deep Q Learning).

Why Deep Q Learning?

In traditional frameworks for Q Learning, we store state-action pairs and their associated Q value in memory. In the game of blackjack, this is a tractable solution, only if we exclude card count entirely. To review, let’s take an arbitrary state in blackjack as below:

(player_total, house_shows, useable_ace)

player_total can range from 4-21, house_shows can range from 2-11 (we’ll count Ace as 11), and useable_ace is a binary variable. For each of these, we have associated actions: Hit, Stay, Double, Split, Surrender (excluding insurance for now). For a computer, storing all these state-action pairs in memory is feasible, and can be accomplished as a simple dictionary in python.

However, what if we have more states, more actions, or what if some of the states are continuous values? The number of state-action pairs can grow drastically, to the point where storing all of these possibilities in memory is completely intractable. That’s where Deep Q Learning comes in.

What is Deep Q Learning?

In Deep Q Learning, rather than learning the Q function directly, we are able to learn an approximation for this Q function. Deep Neural Networks can theoretically learn any function, through a combination of linear and non-linear layers. The best depiction I’ve seen on Q Learning vs. Deep Q Learning is the following:

Q Learning vs. Deep Q Learning.

Consider Q learning as a look-up table. The input is the state-action pair as the key, and the resulting value is the Q value of interest for that action in the given state. In Deep Q Learning, the state is our input, and the outputs are the Q values for all states. We are able to learn this approximation for the Q value, without having to store each state-action pair in memory.

Q Learning: Look up Q(s,a)
Deep Q Learning: Compute Q ≈ f(state) → Q[action_index]

Training

Training Neural Networks for Reinforcement Learning is not a straightforward task. I’ll save you the maths behind it (for now…), and touch on Experience Replay and Online vs. Target networks. I’ll also touch Splitting and Action Masking, which are more specific to my implementation, at a high level.

Experience Replay

In order to minimize correlations in the training process, we incorporate what is called experience replay during training. This experience replay “buffer” stores our recent memory of gameplay. For each training iteration, we simulate blackjack gameplay by taking actions according to our neural network, and store (state, action, reward, next_state) pairs in a buffer (deque data structure). In reality, I expand this tuple out a bit further, and store the following in the buffer:

(state, action_space, action, reward, done, next_state, next_action_space)

state and next_state = (player_total, house_show, useable_ace, can_split, can_double)

state and next_state are the inputs to the neural network.

For non-terminal states, we observe a state, take an action, and receive a reward of zero. This is important, as when playing blackjack, we aren’t penalized for taking more moves, we only care about the final outcome. For terminal states, we receive the actual reward received from the round of blackjack.

Example: Player has 9, house shows a 10. Player elects to hit, getting 17, then stay. The house draws 20, so the player loses 1 unit. The following elements are added to the replay buffer:
((9, 10, False, False, True), ["hit", "stay", "double", "surrender"], "hit", 0, False, (17, 10, False, False, False), ["hit","stay"])

((17, 10, False, False, False), ["hit","stay"], "stay", -1, True, None, None)

a_space represents the possible action space for the given state. next_a_space represents the possible action space the the following state, after taking action. I’ll touch on this more later…

During the training process, we sample mini-batches from the buffer, decoupling the rounds of blackjack and reducing correlations during training, and learn on these.

Online Network vs. Target Network

In Reinforcement Learning, we don’t have a “target value”, we just care to maximize our expected returns. However, in order for the neural network to learn in this feed-forward manner, we need a target to evaluate against when computing our loss. We can take our target to be the best available Q value in the next state, which intuitively makes sense. However, this will lead to us chasing a non-stationary target as we constantly update the network. This is not stable and won’t help lead to convergence of the network.

This is where the target network comes in. In Deep Q Learning, we have two networks: the Online Network and the Target Network. The Online network is what computes our estimates based off the state and actually receives parameter updates. The Target network is initialized with the same parameters as the online network. The difference is, the parameters for the Target Network are frozen during training, and periodically, the Online Networks parameters are copied over to it. To correct for the “chasing a non-stationary target”, we compute losses using outputs of the online network for the current state compared to the outputs of the frozen target network for the future state.

For each iteration during training, we compute the Q values for a given observed state (sampled from the replay buffer), and evaluate it against the temporal difference target of the following state. Here’s a rough outline of what that looks like:

f_o = online network
f_t = target network

Q_s = f_o(s)
Q_s` = f_t(s`)
TD_t = reward_s + gamma * (1 - done) * max(Q_s`)

loss = loss_fct(Q_s[action_index], TD_t)

Let’s break this down… In each state, we take an action, and this either gets us to a terminal state or not.

terminal state: done = 1 , next_state is “empty”, reward_s is observed
non-terminal state: done = 0, next_state is not “empty”, reward_s is not observed

(1-done) is basically just a hacky solution around needing an if/else statement. If done=1, then the entire term on the right evaluates to zero, and TD_t is equal to the reward_s observed in that terminal state (reward received from the round of blackjack). If done=0, then TD_t is equal to the term on the right, as reward_s=0 in non-terminal states, so we’re left with the discount factor gamma multiplied by the maximum Q value available in the next state s`.

Q_s is the output tensor from our online network, where each element corresponds to the Q value of an action. We evaluate loss only on the Q_s corresponding to the index of the action taken, according to our replay buffer (I include [action_index]) Q_s` is the output tensor from our **target network**, again, based off the next state s` input values. In TD_t we only care about the maximum available value in this next state (as this corresponds to our “best move”)

Splitting

When I say “splitting”, I mean the event of splitting cards in blackjack. This is a rather confusing action, as it emulates an additional hand for a given player. When a split occurs, the replay buffer gets an additional observation added to it, to account for both new hands resulting from the split, as shown below. Splitting cards rarely leads to a terminal state, so as far as the experience replay buffer knows, there are rarely rewards associated to it. Splitting will only lead to a terminal state if a player splits Aces (and the newly dealt cards aren’t also Aces), or if the newly dealt cards leads to a total of 21. Remember, the training process samples from the replay buffer, so the same round of blackjack and the sequential series of moves are de-correlated, and there’s no knowledge that that these “splits” were sourced from the same hand.

I found that not additionally accounting for a special treatment of splitting led to the network thinking that splitting was always the best move, if available. I experimented with several solutions, but the one I came up with is the following:

If the split results in a terminal state, use that reward in the buffer. If the split doesn’t lead to a terminal state, take the average reward at the end of each hand’s gameplay, across all hands, and use that as the reward for the “split” action.

Example: - Player splits 8’s, while the house shows a 7. - In Hand_1, the player receives a 7, then doubles, and ends up with 19. - In Hand_2, the player receives a 5, then hits to get a 16, and stays. - The house ends up drawing 17. Hand_1 wins 2 units, Hand_2 loses 1 unit. Average reward is +0.5 units. The replay buffer gets the following observations added.
From Hand_1:
((16, 7, False, True, True), ["hit", "stay", "double", "surrender", "split"], "split", 0.5, False, (11, 7, False, False, True), ["hit","stay","double"])

((11, 7, False, False, True), ["hit","stay","double"], "double", 2, True, None, None)

From Hand_2:
((16, 7, False, True, True), ["hit", "stay", "double", "surrender", "split"], "split", 0.5, False, (9, 7, False, False, True), ["hit","stay","double"])

((9, 7, False, False, True), ["hit","stay","double"], "hit", 0, False, (16, 7, False, False, False), ["hit","stay"])

((16, 7, False, False, False), ["hit","stay"], "stay", -1, True, None, None)

Note the +0.5 reward given immediately to the “split” action, despite it not being a terminal state.

Action Masking

In the game of blackjack, not every action is possible in every state. For example, if you don’t have a pair, you can’t split. If you’ve already hit, you can’t then double. You can’t just make the network have different output dimensions for different scenarios. You can imagine that if an agent has thousands of possible actions, and a majority are invalid actions for any given state, masking is incredibly important. I attempted two methods of accounting for this.

Implicit Action Masking

In implicit action masking, we can have the network learn which moves are invalid given the state. Since the state includes can_double, can_split , the network should be able to figure out which actions are invalid. During training, and while generating the replay buffer, if an invalid move was encountered, I can heavily penalize this by immediately stopping gameplay and forcing a non-positive reward. The “non-positive” reward introduces another hyper-parameter to the learning process.

Explicit Action Masking

In explicit action masking, I can have the network generate outputs for all states, but mask the outputs for invalid actions explicitly, by forcing the resulting Q values to be large negative values. This is why I introduced action_space and next_action_space earlier. This means that the network won’t implicitly learn that they are bad actions by computing low Q values, however, we simply we never select them as a “post-processing” means. So, versus inducing action masking via the rewards and learning process, we’re inducing it via the Q values masking themselves.

I implemented both, but elected to go with Explicit Action Masking, against my initial plans. I didn’t like that the network would learn high Q values for invalid actions through explicit action masking. However, explicitly masking guarantees that we’ll never select them, and I found that it led to better convergence and performance. Also, implicit action masking introduced an additional hyper-parameter that I had to account for, as determining this “invalid penalty” or “non-positive reward” was something that required tweaking and the model was quite sensitive to it. I’m sure more work can we done here…

Results

Training

To train the network, I use the following:

Replay Buffer Size = 10,000 observations (deque, it’s max length is this size)
Minimum Replay Buffer Size = 1,000 observation (don’t begin training until the buffer reaches this length. Initially, the buffer is filled by simulating gameplay using random moves)
Gamma = 0.99
Learning Rate = 0.0001
Target Update Frequency = 1,000 (every 1,000 epochs, copy the online network’s parameters over to the target network, then freeze the target network again)
Number Epochs = 1,000,000
Batch Size = 32 (mini-batch sampling of the replay buffer, in each epoch)
Smooth L1 Loss (Huber Loss). I experimented with MSE Loss, but found Huber loss to lead to more stable convergence.
Adam Optimizer
Greedy Epsilon (in each epoch, take random actions with probability EPS, otherwise take best action. Exploration vs. Exploitation)
Explicit Action Masking

I decay my epsilon value according to the following:

EPS_MIN = 0.1
EPS_DECAY = -log(EPS_MIN) / (N_EPOCHS * 0.75)

EPS = max(EPS_MIN, exp(-EPS_DECAY * epoch_number)

Pseudocode

Training is performed as the following:

online_net = initialize_net()
target_net = initialize_net()
copy_online_to_target()
buffer = initialize_replay_buffer()

for epoch in n_epochs:
  update_replay_buffer()
  if len(replay_buffer) < min_allowed: continue
   
  samples = sample_from_replay_buffer()
  train_online_network(samples)
  
  if not epoch % copy_frequency:
    copy_online_to_target()

While len(replay_buffer) < min_allowed , we take random actions to update the buffer. After this, we use the online_net to take actions (unless EPS tells us to take random actions).

Value Function

Once the network is trained, I can compute the value function for 3 mutually exclusive events:

No Useable Ace, Cannot Split
Useable Ace, Cannot Split
Can Split

The value function is simply the maximum Q value for each valid state. We assume that there is no stochasticity in actions: if there is an available best action, it is guaranteed to take it. For these plots, I assume that doubling is possible (and doubling is allowed after a split).

Value function for each distinct type of state. I use 2-dimensional linear interpolation for visual purposes.

We achieve similar performance seen in my previous post, and the value functions seem quite similar.

By the end of gameplay, we can simulate 100 different games, where each game is played for 50 rounds. We on average achieve -0.0132 units of profit per hand.

Optimal Play

Similar to the logic behind generating the value function, I can explicitly list the optimal play according to the learned network. For each, if “double” was the optimal play, I also depict the next best alternative move if double was no longer available to us.

Optimal Moves for each mutually exclusive state. On the Y-axis is the total that the player shows. On the X-axis the card that the house shows (11 == "Ace").

While there is a great amount of intuition behind “No useable ace” and likely “Can split”, the “Useable ace” category is not as intuitive. The “Can Split” seems to grossly underestimate the value of doubling on a pair of 5’s.

Conclusion

While blackjack, WITHOUT card count, is a tractable solution in the standard reinforcement learning framework, I provide a basic overview of how to adapt it to the deep q learning framework. I show the value functions and optimal gameplay based off this learned strategy. Admittedly, I don’t place a terribly big emphasis on optimizing the model and achieving high performance, however, results are sufficient for evaluating some baseline performance. By end of training, the model was able to achieve -0.0132 units won per round. This framework will be incredibly important for incorporating card count, which I’ll dive into much further in my next blog post!

Github is linked below :)

https://github.com/petersim1/Blackjack_RL

Reinforcement Learning - Blackjack

peter-simone@newsletter.paragraph.com (Peter Simone) — Mon, 02 Jan 2023 23:14:55 GMT

I’ve recently began experimenting within the field of Reinforcement Learning (RL). Incorporating real-world experiences into machine learning seems intuitive in many scenarios. A fun problem that I wanted to tackle was the game of Blackjack. I’ve always felt that this is a rather simple game to understand, but digging deeper poses an interesting probabilistic problem worth exploring. We hear about gaining an edge over the house, and I wanted to understand, to what extent is that possible? Anyone can google “optimal blackjack gameplay”, and see a chart of what moves you should take at a given moment. Actually, if playing live, you can even ask the dealer what the optimal move is, and they’re obliged to tell you. But I wanted to generate this optimal gameplay myself, and further understand how “optimal” a specific gameplay is.

The game of Blackjack

I’m going to skip over a thorough outline of the rules of Blackjack. Basically, have a higher total than the house, without going over 21. I will, however, call out specific rules of play my modules and RL training processes follow. Specifically, I have 2 separate modules: One controlling the Player(s), and one controlling the overall Game.

I assume that 6 decks of cards are used. I “cut” the deck 2/3 of the way through these 6 decks, dictating when the deck should be reshuffled. I allow for Hit, Stay, Split, Double, and Surrender. For the purpose of creating this RL agent, I don’t care for Insurance or any other side bets, as they don’t impact gameplay directly. In blackjack, possible actions are conditional on the state that you are in, and the number of moves you have already taken. For example, after electing to hit, you can no longer double, surrender, or split, so the action space changes.

By default, I use the following rules, which might vary across casinos, but are important to define during training / inference. Training with different “rule” hyper-parameters will likely impact visualizations shown later in this post, and are generally accepted to give the house or player an edge, depending on the rule.

Dealer Hits on Soft 17. I found this is a typical “Vegas rule” known as “H17” (versus “S17”). Shown to be more favorable for the house.
Doubling after split is allowed. In some variations of blackjack, this option is not given to a player.
Multiple Splits allowed. Some variations will limit you to one split, but I allow for however many splits possible.
Player CANNOT hit after splitting Aces. Some variations of blackjack might allow for this, but generally, after splitting Aces, the player is dealt one card and is not allowed to hit again. However, if dealt another Ace, the player can (and should) split.
Blackjack pays 3:2. This is a common payout for natural blackjack. However, on many low minimum tables, blackjack payout might be reduced to 6:5 or even 1:1.
Don’t allow for Surrender: Surrender allows a player to forfeit half their wager and immediately end gameplay. I purposefully exclude this play, for both early and late surrenders. I’ll explain more later in this post.

Reinforcement Learning

In the field of Machine Learning, specifically supervised machine learning, we observe some prediction, compare it to ground truth, and learn from it. Think of regression models, where we try and predict a real number from some data; we have a ground truth real number during training, and learn to best fit data to predict that ground truth. At each set of datapoints, we can predict some outcome, and directly observe whether it was a good prediction or not and learn from it.

Reinforcement Learning is a rather different approach. Rather than predicting a direct result, we try and maximize our cumulative rewards from a series of states and actions taken. Think of trying to create an agent that moves through a maze. We don’t have “ground truth” or direct “labels” about what the immediate action should be. However, we know that if the agent completes the maze efficiently without human interference, then it was properly trained. Maybe each step receives a reward of -1, and reaching the goal receives a reward of 10. Surely, reaching the goal in the least number of steps leads to the highest cumulative reward. But how do we train this, since we don’t observe immediate feedback from a given action? That’s where Reinforcement Learning comes in.

In the game of blackjack, we have an environment. Let’s think of this as the rules of the game, the objective, and generally just the boundaries of what’s possible in the game. At each point during the game, we observe a state. Simply put, this includes a player’s current cards and the face-up house card (let’s assume we have no idea what the card count is, as we easily forget what cards we’ve seen previously). We are presented with a policy of possible actions to take. What is our optimal action according to this policy? Can we learn what the optimal action should be? We don’t necessarily care what the immediate action is, we simply care about maximizing our reward from the series of actions taken. Maybe we take a seemingly sub-optimal immediate action, because it actually leads to a higher long term reward.

Q Learning

Blackjack is an episodic task; observe a state, take an action, and repeat until the gameplay ends, meaning we’ve reached a terminal state. At each step, given a current state, we can take a specific action that is available in that state. Upon each action, we receive some reward. In some instances, we can observe the probability of moving to that next state, given the action taken. In other instances, we don’t know this transition probability, so we can consider this model-free learning. In this blackjack reinforcement learning agent, I’ll use a model-free approach to learn an optimal policy, maximizing our cumulative rewards from gameplay.

Traditionally in Reinforcement Learning, we can observe the Value of being in a given state. Ideally, we’d want to take an action that leads to the maximum Value of the next state. Newer approaches abstract this further, and introduce the concept of Q values, which denote a measure of state-action pairs. So, rather than simply observing Value of a state, we observe a Q value of a state-action pair, and can selection our action accordingly.

Through reinforcement learning, we aim to learn to the Q function through iterative approaches.

Q(s,a) = Q(s,a) + lr*[R + gamma * max(Q(s`,a)) - Q(s,a)]

At each state s, we take an action a, receive a reward R, and end up in state s`*. Assuming that *s` is not a terminal state, we can evaluate the optimal Q value in this new state as well (if it is terminal, we’ll assume it’s 0). Essentially, we are taking a weighted average of current Q value and new information.

Q(s,a) : Q value of a state-action pair
max(Q(s`,a)) : Maximum Q value across all actions in the following state (the state in which our current action brings us to)
lr: learning rate. How quickly the agent learns, or how large the “jumps” are in Q values between learning an iteration
R : reward received
gamma: discount factor. Importance of future rewards. Must be in the range [0,1). Help ensure convergence due to infinite sums. Larger values mean we care about future rewards more.

This requires storing the Q values of all state-action pairs in memory. In our case of Blackjack, without accounting for card count, this is a tractable problem computationally, which is why simple Q learning is a valid approach.

The “states” I use are: Player Card Total, House Card Shown, Useable Ace. The “actions” are: Hit, Stay, Double, Split. Each pair of these will have an associated Q-value. As mentioned earlier, not every action is feasible given the current state, so these unreachable states are “masked” given the valid moves determined by the Player module.

Exploration vs. Exploitation

Exploration vs. Exploitation is a core problem of reinforcement learning. How much new exploration do we do of the space, versus how much do exploit the current knowledge of what we know is “good” or “bad”. There are ways around this during the training process, such as :

Greedy policies: Every single time, we take the action that we know is best given our current information.
- Not good. If by chance we take a suboptimal action, and it turns out to be decent, we’ll continue taking it in each iteration, without exploring other actions.
e-Greedy policies: Each episode, with probability epsilon, we randomly take an action, otherwise we take the best action. Through learning, we can decay the epsilon value, such that we transition from high exploration initially, to more exploitation later in learning.
Posterior Sampling: Each episode, we sample from the Q-space to determine an action. State-action Q values that are “good” are more likely to be selected, but we never allow for zero probability of selecting a “less good” action given the policy and the current state.

I found that posterior sampling leads to my best agent performance in blackjack. It allows me to sample the space accordingly, and have well refined Q values for each state-action pair, even if an action is sub-optimal, which I consider important at inference.

My Approach

I developed some python modules to simulate a Player and overall Gameplay. Next, I created some logic for generation of actions given a state and policy, the generation of episodes given these state-action pairs and the current policy, the Q-learning process, and policy evaluation. Some important hyper-parameters / logic I use for the gameplay modules are given below.

I assume that only 1 Player is used during training, who wagers 1 unit each round.

Cards are depleted during gameplay until the “stop card” is reached. I place this stop card 2/3rd’s through the deck. This is a solution that casinos game up with to avoid perfect card counting towards the end of decks. Some casinos re-shuffle after every used card, but I don’t do this.

I evaluate the Q-learning process every 50,000 rounds, where I evaluate 200 separate games for 50 rounds each. This allows me to get nice bootstrapped confidence intervals and mostly normal distributions of cumulative rewards.

I don’t do any early stopping during training, and I don’t do any backtracking of optimal Q values either.

If the house is dealt a natural blackjack, I skip the learning process for that round. There’s nothing a player can do (sort of).

I intentionally exclude “surrender” as being a valid move. I initially had this, but I don’t have a valid way of handling it properly without counting it as a side bet. Since it’s not a true side bet (as “insurance” is), and surrendering is a function of your current cards and the house card, I might re-visit this. However, I was observing that I was improperly accounting for surrenders given that I skip training if the house shows a natural blackjack.

Learning Rate = 0.001
Gamma = 0.95
Train for 2,500,000 episodes

Action Masking

In order to properly sample from the Q space and determine the optimal policy, each player’s turn has the action space masked according to their current state. For example, “double” is only available as a first move. This means that after the player’s first move, “double” is masked in the action space and will never be sampled from or be selected as an optimal play. Doing this, versus adding additional partitions to the state-action pairs, allows me to share more information across iterations.

For example, I’ve seen implementations where the state-action pair includes player_total, house_shows, useable_ace, can_split. Actually, this was my first implementation as well. But abstracting away “can_split”, and using action masking instead, led to more efficient training and more shared information between states.

Training

I initialize my Q values using a python dictionary, setting all state-action Q values to zero. I noticed that I am able to differentiate all states simply by using a Q value dictionary of the following form:

Q key: (player_total, house_shows, useable_ace)
Q value: {
  "hit": 0,
  "stay": 0,
  "double": 0,
  "split": 0
}

The most confusing state is when the player total is 12. This can mean [8,4], for example, or a pair of 6’s, or a pair of Ace’s. The “useable_ace” value allows me to differentiate between pair of 6’s and Ace’s.

As I mask the action space based off the current state, we are able to share information between states where a split is possible, versus when it is not (ie, share information between “hit” for an 8,4 and a 6,6), as “split” would be masked for the former, while the latter will still have access to the Q value for “hit” learned from the former.

I can use monte-carlo methods to simulate gameplay. I currently don’t review the impact of including multiple players during training, although I don’t think this should impact performance, only complexity. Again, Q-values are updated each iteration, but performance is only evaluated every 50,000 iterations. Further, I can evaluate the learning process to a Baseline policy, which was found online (and adopted to fit the data structure used above), by comparing the % of moves that align with the baseline policy during inference. Results of training are shown below:

Top: Average unit rewards per round played during training. Each point represents the mean reward of 200 games, each played for 50 rounds. Bottom: Compared to a baseline policy, showing the percent of optimal actions that match the baseline optimal action.

Evaluation

With my learned policy, I can now evaluate the agent against other benchmarks. Note that during training, I sample the state-action space to determine the action selected. During inference, I deterministically select the optimal action. I have 4 total policies that I evaluate

Learned: the learned policy from training, shown above.
Accepted: I construct this policy based off internet searches of optimal policies. Takes into account useable Ace, doubling down, surrendering, and splitting. Remember that I purposefully exclude surrendering, but keep it in for this baseline policy.
Random: Simulate completely random gameplay, while still using action masking given the current state.
Mimic House: Emulate the house, meaning stay on anything >= 17 (even if it’s a soft total), hit otherwise.

Bootstrapped distributions of gameplay for various policies. For each policy, I simulate 2,500 games, each played for 50 rounds. I take the mean cumulative rewards per round and plot their distributions.

I can state the expected return of these different styles of gameplay, in terms of units

Random: -0.4243 net units per round. ie, play 100 rounds, and you’re expected to lose 42 units. I expected this to be closer to -0.5
Meh: -0.0628 net units per round
“Good”: -0.0094 net units per round. Nearly even expected value
Learned: -0.0082 net units per round. Nearly even expected value.

To show the impact of the randomness of blackjack, I’ll explicitly show 10 different games played, over 100 rounds, and show the cumulative reward achieved in each.

Cumulative rewards achieved using the learned policy, across 10 different players, to show volatility in performance.

Value Function

I can display the value function by taking the expectation over the Q space for all actions. To achieve this, for each state-action pair, we can take the maximum Q value across all actions. This allows me to visualize the value of being in a current state. I think these are rather neat visualizations. It shows the value of holding a 10 or 11, as expected, but value steeply drops off with card totals higher than that until you have an ~18 or higher. The house showing a 2 is actually not more advantageous for the player than the house having a 3 or 4, for example. Blackjack (21) has a value of 1.5, which is expected given the higher payout, regardless of the card that the house shows (although not shown). It becomes clear where surrendering might be advantageous to a player, where the Value drops steeply.

3D value functions across 3 distinct types of states, for clarity. Note that these are mutually exclusive plots, given by their labels.

Optimal Move

As I’m sure it’d be of interest, I include color coordinated tables displaying the optimal action for a given state. Note that these optimal plays do not mean that they were far away the best action to select, they simply denote the action with the highest Q value. It could’ve narrowly edged out the next best choice, of which I’m sure many of these could be improved through additional training steps.

For clarity, I also provide the baseline policy table. These are mutually exclusive tables. Also note that (“A”, ”A”), is not included in the “useable ace” table, even though it does have a useable ace, but it is in “can split”.

The options are as follows:

Hi: Hit (white)
St: Stay (yellow)
Su: Surrender (gray, only in baseline policy)
Sp: Split (blue)
Do / Hi: Double if allowed, else hit (green)
Do / St: Double if allowed, else stay (green)

Baseline optimal policy.

Learned optimal policy.

It is clear that there is drastic overlap between the “No Useable Ace” category. While I exclude surrender, the learned policy does not have this as an optimal move. In the “Useable Ace” category, the learned policy seems to have a high amount of overlap except for some differences in when to double. “Can Split” has the least amount of overlap, as the learned policy elects to split far more often than in the baseline policy, mainly for pairs where the house shows a high card.

Typical House Results

To provide further insights into how the house might perform given a face-up card being shown, I can further simulate gameplay to capture this information. Note that I don’t care about how the player plays. The house always plays independently of the player’s moves, which allows me to provide the following information more easily.

Since in every round, the house can either end with 17-21, or bust, I can show the likelihood of these events occurring through monte-carlo methods, given the card that the house shows. It seems quite clear that the house showing a 6 leads to the highest probability of the house going over 21, and is likely the most favorable card to be up against as a player. Here, 21 is inclusive of natural blackjacks, which is only possible if the house shows a 10 or 11. Being up against a house 2 is less favorable than a 3-6.

Given the House Card, these are the probabilities of the house's outcome.

When the house does bust, I can gather the probability of them drawing a certain total. I was interested in this as I saw some low minimum tables PUSH when the dealer shows 22, which is clearly the most probable outcome when the dealer busts (stay away from that game…).

Given that the Dealer Busts:
Total 22: 25.72%
Total 23: 23.04%
Total 24: 20.07%
Total 25: 17.23%
Total 26: 13.94%

Bankroll

Bankroll is an important component to consider during Blackjack. In games like poker, the concept is likely more intuitive. If your bankroll is significantly higher than someone else’s, you can wager a small portion of your bankroll, while forcing someone else all-in, drastically changing the style of play and risk tolerance.

In blackjack, it has a slightly different connotation, as you are playing against the house and not against other players. From everything I’ve shown in this post, we know that the RL agent I trained is able to achieve roughly 0 Expected Value: Play forever, with infinite bankroll, and achieve roughly 0 rewards per round. However, let’s assume we do have a bankroll, and the minimum wager consists of some percentage of our bankroll. Due to the randomness of blackjack, even using our “optimal” agent, there will be instances where our cumulative rewards dip below our bankroll threshold, forcing us to stop playing. We’ll never have a chance to continue gameplay to approach the roughly 0 EV over time.

Let’s break this down more visually. We’ll simulate gameplay for 500 games, each game allowed to play for a maximum of 500 rounds. We can look at the probability of going broke in less than N rounds of gameplay (CDF), given our initial bankroll, and a wager of 1 unit per round. As our bankroll increases with respect to the minimum wager, our probability of going broke decreases, as we have the capabilities to play more rounds to achieve our roughly 0 EV.

Probability of being bankrolled in less than N rounds (CDF), assuming 1 unit is wagered per round. The blackjack environment and policy allow for splitting and doubling. The probabilities will not sum to 1 for <500 rounds, as we have a maximum cap of 500 rounds and some agents exceed this.

We can also look at the percent of hands that are profitable across each of these simulated gameplays. With different bankrolls, we don't observe great differences between the percent of profitable hands during that gameplay. This holds except for very low percent of profitable hands, where we see a large initial spike for 0% profitable hands for small bankrolls.

CDF of the % of hands were the agent was profitable

Most interesting, we can observe the effects of bankroll with respect to minimum wager on the unit profit per hand. Remember, if we have infinite bankroll, our expected value should hover around zero. However, it becomes clear that as our bankroll decreases with respect to the minimum wager of 1 unit, our likelihood of observing meaningfully negative returns per hand will increase. This is due to the randomness of blackjack, and enforcing constraints of a player’s initial bankroll. For example, let’s say we have a bankroll of 5 units, and wager 1 unit per hand. If we lose 5 hands straight (without doubling or splitting, which increases our wager), then we’ll run out of money after 5 rounds. Our net profit per hand is -1. We’ll never have a chance to continue gameplay to approach our expected value. With a small bankroll, we are more at the mercy of the randomness of blackjack. As our bankroll continues to increase, the plot will shift towards our expected value of around 0, with high probability.

CDF of the unit profit per hand, partitioned by bankroll units.

Conclusion

I was successfully able to create a blackjack agent to deeply understand the value of taking a certain move given the player + house states. Using Q-Learning, we learn an “optimal" policy, able to achieve roughly even expected value over time.

While some of the learned optimal actions might intuitively seem sub-optimal, the learned policy is able to perform on-par with a baseline policy when looking at simulated cumulative rewards. This could be due to the low likelihood of reaching these states. When looking at the specific Q values where the optimal action differed from the baseline action (ie our learned policy said to split, while the baseline said to hit), a large majority of these optimal actions could’ve differed just through chance of sampling, given how similar their Q values are. Some were quite drastically different, which is interesting.

Sampling from the Q space during training seemed to lead to quicker convergence, and a good balance of exploration / exploitation. Switching to deterministic action selection during inference led to non-negative cumulative rewards, on average.

Observing real-world constraints of bankroll issues for a player leads to an interesting case study of how, while following the same exact optimal policy, decreasing your bankroll with respect to the unit wager makes a player much less profitable per hand, on average. If you walk up to a 20$ minimum table with 100$ (willing to lose it all), and you play the table minimum each hand, you are much worse off per hand, on average, than a player who walks up the same table with 1000$ that plays the table minimum. The player with 1000$ is less impacted by the randomness of blackjack in compared to the player with 100$.

Clearly this is not a learned policy able to guarantee profit. Actually, on average, the expected value is essentially zero. How can we push this even further? Well, card count could be of use… What if we could remember which cards we’ve seen? Even less intensely, instead of remembering exact cards we’ve seen, what if we could remember how many “high” vs. “low” cards we’ve seen, and use that to our advantage?

Including card count in this framework is infeasible. There are far too many states to store if we take it into account. Also, the extent of the boundaries of the card count will differ given the amount of decks in play, and the model would have to be adjusted accordingly. Our current implementation has 270 states. Let’s assume a 6 deck shoe, cut 2/3 of the way through. I’ve seen card count range from -20 to 20 in this deck size. This could lead to over 10,000 states. This poses an issue because 1) that requires a lot of memory, and 2) states will be visited so infrequently that you’d need to train way longer to get any meaningful result. Deep Q Networks are a perfect use case to solve this.

In a future article, I’ll share my work on Deep Reinforcement Learning, where rather than explicitly storing Q values in memory, we can learn a neural network to approximate our Q function, and be able to take card count into account.

While the github is still a work a progress, you can find it below. Of course, feel free to fork it and experiment! I haven’t seen as thorough of an implementation allowing for card splitting or performance nearing non-negative rewards.

https://github.com/petersim1/Blackjack_RL