Thought leadership from the most innovative tech companies, all in one place.

Building a Tic-Tac-Toe Game with Reinforcement Learning in Python: A Step-by-Step Tutorial

Welcome to this step-by-step tutorial on how to build a Tic-Tac-Toe game using reinforcement learning in Python. In this tutorial, we will learn how to create an agent that learns to play the game by trial and error, taking actions and receiving rewards or penalties depending on whether the action led to a win, loss, or draw.

Before we start, here are the prerequisites for this tutorial:

  • Basic knowledge of Python programming
  • Familiarity with the fundamentals of machine learning, specifically reinforcement learning
  • Familiarity with the TensorFlow library

In this tutorial, we will cover the following topics:

  1. Setting up the environment
  2. Defining the Tic-Tac-Toe game
  3. Building the reinforcement learning model
  4. Training the model
  5. Testing the model

Now, let’s get started with the first step.

Step 1: Set up the Environment

The first step is to set up the environment. We need to install the necessary libraries and import them into our Python script.

We will be using the following libraries:

  • TensorFlow: an open-source machine learning library for dataflow and differentiable programming across a range of tasks.
  • NumPy: a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • random: a built-in Python module used for generating random numbers.

To install these libraries, you can use pip, the package installer for Python. Open your terminal or command prompt and run the following commands:

pip install tensorflow
pip install numpy

Now that we have installed the necessary libraries, let’s import them into our Python script. Open your Python IDE or text editor and create a new Python file. Start by importing the required libraries:

import tensorflow as tf
import numpy as np
import random

Step 2: Defining the Tic-Tac-Toe Game

In this step, we will define the Tic-Tac-Toe game. The game is played on a 3x3 grid, with two players taking turns to mark a square with their symbol (either X or O).

The objective of the game is to get three of your symbols in a row, either horizontally, vertically, or diagonally.

To define the game, we need to create a class that represents the game board and the players. Here’s the code:

class TicTacToe:
    def __init__(self):
        self.board = np.zeros((3, 3))
        self.players = [''X'', ''O'']
        self.current_player = None
        self.winner = None
        self.game_over = False

    def reset(self):
        self.board = np.zeros((3, 3))
        self.current_player = None
        self.winner = None
        self.game_over = False

    def available_moves(self):
        moves = []
        for i in range(3):
            for j in range(3):
                if self.board[i][j] == 0:
                    moves.append((i, j))
        return moves

    def make_move(self, move):
        if self.board[move[0]][move[1]] != 0:
            return False
        self.board[move[0]][move[1]] = self.players.index(self.current_player) + 1
        self.check_winner()
        self.switch_player()
        return True

    def switch_player(self):
        if self.current_player == self.players[0]:
            self.current_player = self.players[1]
        else:
            self.current_player = self.players[0]

    def check_winner(self):
        # Check rows
        for i in range(3):
            if self.board[i][0] == self.board[i][1] == self.board[i][2] != 0:
                self.winner = self.players[int(self.board[i][0] - 1)]
                self.game_over = True
        # Check columns
        for j in range(3):
            if self.board[0][j] == self.board[1][j] == self.board[2][j] != 0:
                self.winner = self.players[int(self.board[0][j] - 1)]
                self.game_over = True
        # Check diagonals
        if self.board[0][0] == self.board[1][1] == self.board[2][2] != 0:
            self.winner = self.players[int(self.board[0][0] - 1)]
            self.game_over = True
        if self.board[0][2] == self.board[1][1] == self.board[2][0] != 0:
            self.winner = self.players[int(self.board[0][2] - 1)]
            self.game_over = True

    def print_board(self):
        print("-------------")
        for i in range(3):
            print("|", end='' '')
            for j in range(3):
                print(self.players[int(self.board[i][j] - 1)] if self.board[i][j] != 0 else " ", end='' | '')
            print()
            print("-------------")

Let’s go over the methods of the TicTacToe class:

  • init: This method initializes the game board, players, current player, winner, and game over status.
  • reset: This method resets the game board, current player, winner, and game over status to their initial values.
  • available_moves: This method returns a list of available moves on the current board. It loops through all the squares of the board, and if a square is empty, it adds the coordinates of the square to the list of available moves.
  • make_move: This method makes a move on the board. It takes a tuple representing the coordinates of the square where the move should be made. If the square is already occupied, it returns False, indicating that the move is invalid. Otherwise, it updates the board with the current player’s symbol, checks if the move resulted in a win, and switches to the other player’s turn.
  • switch_player: This method switches the current player to the other player.
  • check_winner: This method checks if there is a winner on the board. It checks all the rows, columns, and diagonals to see if there are three symbols of the same player in a row. If there is a winner, it sets the winner and game over status accordingly.
  • print_board: This method prints the current state of the board.

Now that we have defined the TicTacToe class, let’s test it out by creating an instance of the class and playing a game manually. Here’s the code:

game = TicTacToe()
game.current_player = game.players[0]
game.print_board()

while not game.game_over:
    move = input(f"{game.current_player}''s turn. Enter row and column (e.g. 0 0): ")
    move = tuple(map(int, move.split()))
    while move not in game.available_moves():
        move = input("Invalid move. Try again: ")
        move = tuple(map(int, move.split()))
    game.make_move(move)
    game.print_board()

if game.winner:
    print(f"{game.winner} wins!")
else:
    print("It''s a tie!")

This code creates an instance of the TicTacToe class, sets the current player to X, and prints the initial state of the board. Then, it enters a loop that continues until the game is over.

In each iteration of the loop, it prompts the current player to enter their move, checks if the move is valid, and makes the move. After each move, it prints the updated state of the board. Finally, when the game is over, it prints the winner or a tie message.

Now that we have verified that our TicTacToe class works correctly, let’s move on to the next step, which is to implement the reinforcement learning agent.

Step 3: Implement the Reinforcement Learning Agent

In this step, we will implement the reinforcement learning agent that will learn to play Tic-Tac-Toe by trial and error. The agent will use the Q-learning algorithm to learn the optimal policy for each state-action pair.

The Q-learning algorithm is a form of temporal difference learning, which means that it updates its estimate of the optimal Q-values based on the difference between the current estimate and the actual rewards received.

We will represent the Q-values as a dictionary of state-action pairs, where each state is a tuple representing the current state of the board, and each action is a tuple representing the coordinates of the move. The initial Q-values will be set to zero.

Here’s the code to implement the reinforcement learning agent:

import random

class QLearningAgent:
    def __init__(self, alpha, epsilon, discount_factor):
        self.Q = {}
        self.alpha = alpha
        self.epsilon = epsilon
        self.discount_factor = discount_factor

    def get_Q_value(self, state, action):
        if (state, action) not in self.Q:
            self.Q[(state, action)] = 0.0
        return self.Q[(state, action)]

    def choose_action(self, state, available_moves):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(available_moves)
        else:
            Q_values = [self.get_Q_value(state, action) for action in available_moves]
            max_Q = max(Q_values)
            if Q_values.count(max_Q) > 1:
                best_moves = [i for i in range(len(available_moves)) if Q_values[i] == max_Q]
                i = random.choice(best_moves)
            else:
                i = Q_values.index(max_Q)
            return available_moves[i]

    def update_Q_value(self, state, action, reward, next_state):
        next_Q_values = [self.get_Q_value(next_state, next_action) for next_action in TicTacToe(next_state).available_moves()]
        max_next_Q = max(next_Q_values) if next_Q_values else 0.0
        self.Q[(state, action)] += self.alpha * (reward + self.discount_factor * max_next_Q - self.Q[(state, action)])

The QLearningAgent class has three instance variables:

  • Q: a dictionary of state-action pairs representing the Q-values.
  • alpha: the learning rate, which controls how much the Q-values are updated at each step.
  • epsilon: the exploration rate, which controls the probability of choosing a random action instead of the optimal action.

The QLearningAgent class also has three methods:

  • get_Q_value: This method returns the Q-value for a given state-action pair. If the Q-value is not yet in the dictionary, it initializes it to zero.
  • choose_action: This method chooses an action based on the epsilon-greedy policy. If a random number is less than epsilon, it chooses a random action from the available actions. Otherwise, it chooses the action with the highest Q-value.
  • update_Q_value: This method updates the Q-value for a given state-action pair based on the Q-learning algorithm.

The update_Q_value method is the heart of the Q-learning algorithm. It updates the Q-value based on the difference between the current estimate and the actual rewards received.

Step 4: Train the Reinforcement Learning Agent

Now that we have implemented the Q-learning agent, we can use it to train the agent to play Tic-Tac-Toe. During training, the agent will play against a random player and learn from the rewards received.

We will train the agent using a function called train. The train function takes as input the number of episodes to run, the learning rate (alpha), the exploration rate (epsilon), and the discount factor (gamma). It returns the trained Q-learning agent.

Here’s the code for the train function:

def train(num_episodes, alpha, epsilon, discount_factor):
    agent = QLearningAgent(alpha, epsilon, discount_factor)
    for i in range(num_episodes):
        state = TicTacToe().board
        while not TicTacToe(state).game_over():
            available_moves = TicTacToe(state).available_moves()
            action = agent.choose_action(state, available_moves)
            next_state, reward = TicTacToe(state).make_move(action)
            agent.update_Q_value(state, action, reward, next_state)
            state = next_state
    return agent

The train function creates a new Q-learning agent with the given alpha, epsilon, and discount_factor. Then, for each episode, it initializes the game board and plays until the game is over.

At each step, it chooses an action using the choose_action method of the Q-learning agent, makes the move, and updates the Q-value using the update_Q_value method of the Q-learning agent.

Step 5: Test the Reinforcement Learning Agent

After training the Q-learning agent, we can test its performance against a random player.

We will create a function called test that takes as input the trained Q-learning agent and the number of games to play. It will return the percentage of games won by the Q-learning agent.

Here’s the code for the test function:

def test(agent, num_games):
    num_wins = 0
    for i in range(num_games):
        state = TicTacToe().board
        while not TicTacToe(state).game_over():
            if TicTacToe(state).player == 1:
                action = agent.choose_action(state, TicTacToe(state).available_moves())
            else:
                action = random.choice(TicTacToe(state).available_moves())
            state, reward = TicTacToe(state).make_move(action)
        if reward == 1:
            num_wins += 1
    return num_wins / num_games * 100

The test function plays the specified number of games against a random player. At each step, if it is the Q-learning agent’s turn, it chooses an action using the choose_action method of the Q-learning agent. Otherwise, it chooses a random action. If the game is won by the Q-learning agent, it increments the win count.

Step 6: Run the Code and Analyze the Results

Now that we have implemented the Tic-Tac-Toe game and the Q-learning agent, we can run the code and analyze the results. Here’s an example code to train and test the Q-learning agent:

# Train the Q-learning agent
agent = train(num_episodes=100000, alpha=0.5, epsilon=0.1, discount_factor=1.0)

# Test the Q-learning agent
win_percentage = test(agent, num_games=1000)
print("Win percentage: {:.2f}%".format(win_percentage))

In this example, we train the Q-learning agent for 100,000 episodes with a learning rate of 0.5, an exploration rate of 0.1, and a discount factor of 1.0. Then, we test the agent by playing 1000 games against a random player and calculate the percentage of games won by the Q-learning agent.

After running the code, we can see the win percentage of the Q-learning agent. We can try experimenting with different hyperparameters to see how they affect the performance of the agent.

That’s It!

In this tutorial, we learned how to build a Tic-Tac-Toe game using reinforcement learning. We implemented a Q-learning agent that learns to play the game by trial and error. The agent takes actions (i.e., makes moves) and receives feedback in the form of rewards or penalties depending on whether the action led to a win, loss, or draw. Over time, the agent learns to make better moves and improve its performance.

We also learned how to train and test the Q-learning agent using the train and test functions. We can use these functions to experiment with different hyperparameters and analyze the performance of the agent.

Reinforcement learning is a powerful technique that can be applied to many different problems, including game playing, robotics, and self-driving cars. By understanding the principles of reinforcement learning, we can develop intelligent systems that can learn from experience and improve their performance over time.




Continue Learning