## Background

(TL;DR if you are not interested in knowing *why* I wrote this page, you can skip reading the background and jump directly to the implementation part below)

In the 90s, I had my first fascination into artificial intelligence. I had seen movies like Wargames and read absolutely everything written by Isaac Asimov and the thought of creating something that showed signs of intelligence in a computer was fascinating to me. I had some programming experience from the C64 and had also written some games in C for an early PC that my mother had, but this was before I had decided to work in the field of computer science and I was only 16.

If one wants to simplify the state of artificial intelligence back then one could say that it was rather traditional – text processing was a lot about pattern matching, matching against huge databases (so called expert systems) was a thing and board games AI (like chess or Othello) was mostly about deep searching and evaluating the next move by exploring all possible moves as deep as the hardware allowed and often some special custom-made hardware was required in order to compete with the world champions, like Deep Blue from IBM. Since I did not have access to powerful computers I never got interested in any of those methods for brute-force searching all possible moves (like doing min-max search for instance, which is a method described in any of the more classic AI books at the time). The thought of encoding expert knowledge into machine code and using brute force to solve a problem felt like cheating as I had this romantic thought about what AI was which did not fit into that picture. But then came the advances in artificial neural networks (ANN) and the back-propagation algorithm which, to me, felt like, finally, the “real thing”. Perhaps this was also emphasised to me by the fact that all articles about ANN started with some information about real neurons, axons and synapses and how the ANN technology was about mimicking real neurons.

My interest in programming grew and I started work on making different games. We had a programming class which taught Pascal (using Borland’s Turbo Pascal) and I built an Othello board game that my class mates liked to play during class as many of them were not as interested in programming as I was. The Othello programming effort focused on the visual aspects of Othello though, as it had a very fancy, animated ways of flipping the checkers in the air as they turned over (Othello is about flipping checkers to your colour and the one with most checkers in the end wins – for further info about the rules, see the game rules at worldothello.org) and its AI was very similar to an Othello beginner’s strategy – it tried to, in each move, to capture as many of the opponent’s checkers, where it used a simple weight on the positions when it compared the different possible moves (in Othello it is (often – not always) a very good idea to capture the corners of the board, thus it is possible to describe a board with weights that could lead to a state where you might be able to capture the corners).

Even though some of my class mates enjoyed playing against the above mentioned AI, I had, unfortunately, during the process of developing the game, learnt the game quite well myself, so I always thought the opponent was too easy to defeat. Therefore I started to read up on different methods to enhance the game-play, but, as described above, I never became interested in writing a method that did a min-max deep search or something similar. I had started to learn about artificial neural networks and I had used its, at that time relatively newly invented, back-propagation algorithm in various projects (in one attempt I tried to predict the outcome of premier league football games in order to beat my dad in Stryktipset – a Swedish football lottery thing). Anyway, to cut the story short (as I have already written a lot of irrelevant stuff already here), I found an article by Gerard Tesauro where he used a method called TD(λ) to train an artificial neural network to play backgammon, so I decided to implement that (in C, which was the language I had learnt after Pascal).

Since my math skills were limited (things like gradient decent and Newton-Raphson’s method was something I learnt about later when I grew up) it took me a lot of time just to try and work out what the math formulas meant and to apply that to some data model of a neural net. Since this was before the Internet I did not have any easy way to grab some ready-made ANN-library (if one existed back then even?), so most of the effort was spent on the math and figuring out how to implement that in C. The thing worked in the end, and the system could learn some basic knowledge about Othello by playing against itself, but it never really managed to beat me in a game, which was a bummer. The whole purpose for me was to explore if it is possible to create something that learns to become better than me without giving it lots of expert knowledge or brute force power, so this was a failure.

Fast forward to now – after having worked on many other fields in computer science except AI/ML for quite some time – I decided to, just out of nostalgia and curiosity to return to my old Othello interest and also to refresh myself on the status of AI/ML at the same time. So much has happened during these (soon) 30 years of development. The Internet has happened and the amount of available datasets for fitting models is all of a sudden huge, the open source movement has *happened* which often removes the need for someone to sit alone in the bedroom trying to re-build the wheel, computer performance has improved a lot and it does no longer take weeks to fit a model to some data on a basic laptop and, for some bizarre reason, Python was invented and that became the platform for the AI/ML community with open-source tools like Tensorflow, PyTorch, Numpy and Keras being developed to just name a few of the amazing advances in available AI/ML libraries. You no longer need to fiddle around with the data model on how to represent your neural network, and you no longer need to debug your strange implementation of taking the gradient of some complex function – the open-source libraries does all of that for you and you can focus on the real problem – what you want to achieve with your AI. So in the rest of this post, I will just try to document my newbie attempt of trying to reproduce my old Othello experiment in TD(λ), by making use of what is available today. Mind you, I have not worked a lot with Python (except for the name – I love Monty Python – there are many things with Python that does not agree well with me), and I have never worked with tensorflow, Keras or the other tools before, so there are likely a lot of beginner’s mistakes below. But it is fun to be a beginner in something again I think.

## Implementation

First we need to represent the environment – the game board. You can do that in many ways, such as arrays of arrays of integers, where an integer would have any of three values – occupied, white or black, but I chose to implement the board as two 64 bit integers (the board in Othello is 8×8, so it fits into 64 bits) where the first integer has a one where there is a corresponding white piece and the second integer has a one where there is a corresponding black piece. Something like this:

UNOCCUPIED = 0 WHITE = 1 BLACK = 2 def color_to_string(color): c = ' · ' if color == WHITE: c = ' x ' if color == BLACK: c = ' o ' return c class Board: def __init__(self): self.board = [np.uint64(0), np.uint64(0)] self.set(3, 3, WHITE) self.set(4, 4, WHITE) self.set(3, 4, BLACK) self.set(4, 3, BLACK) def set(self, x, y, color): index = -1 if color == WHITE: index = 0 else: index = 1 if index != -1: mask = np.uint64(1)<<np.uint64(y * 8 + x) self.board[index] = np.bitwise_or(self.board[index], mask) def clear(self, x, y): mask = np.uint64(1)<<np.uint64(y * 8 + x) self.board[0] = np.bitwise_and(self.board[0], ~mask) self.board[1] = np.bitwise_and(self.board[1], ~mask) def get(self, x, y): mask = np.uint64(1)<<np.uint64(y * 8 + x) if np.bitwise_and(self.board[0], mask) > 0: return WHITE if np.bitwise_and(self.board[1], mask) > 0: return BLACK return UNOCCUPIED def print(self): print(' a b c d e f g h') for y in range(8): print('{} '.format(y+1), end='') for x in range(8): color = self.get(x, y) c = color_to_string(color) print(c, end = '') print('')

After that we implement how pieces get flipped as you place a piece in a certain place in a board together with some helpful functions to generate all possible moves a given colour can do on a given board.

def other_color(color): if color == WHITE: return BLACK if color == BLACK: return WHITE def make_move(x, y, color, b): if x < 0 or x > 7 or y < 0 or y > 7 or b.get(x, y) != UNOCCUPIED: return 0 dx = [-1, 0, 1, -1, 1, -1, 0, 1] dy = [-1, -1, -1, 0, 0, 1, 1, 1] flipped = 0 otherColor = other_color(color) for d in range(8): cx = x + dx[d] cy = y + dy[d] hasFoundMyOwnAtEnd = False while cx >= 0 and cy >= 0 and cx < 8 and cy < 8 and b.get(cx, cy) == otherColor: cx += dx[d] cy += dy[d] hasFoundMyOwnAtEnd = True if cx >= 0 and cy >= 0 and cx < 8 and cy < 8 and b.get(cx, cy) == color and hasFoundMyOwnAtEnd: # this is a valid move, we found our own color after # step all the way back to start and flip all of the other color cx -= dx[d] cy -= dy[d] while cx != x or cy != y: b.clear(cx, cy) b.set(cx, cy, color) cx -= dx[d] cy -= dy[d] flipped += 1 if flipped > 0: b.set(x, y, color) return flipped def to_coord(x, y): return '{}{}'.format("abcdefgh"[x], y+1) def get_possible_boards(b, color): boards = [] for y in range(8): for x in range(8): clone = deepcopy(b) flipped = make_move(x, y, color, clone) if flipped > 0: boards.append((clone, flipped, to_coord(x, y))) return sorted(boards, key=lambda a: a[1], reverse = True) def game_over(b): return len(get_possible_boards(b, WHITE)) == 0 and len(get_possible_boards(b, BLACK)) == 0

In order to simulate games between different opponents we create a representation of a Player, like so:

class Player(ABC): def __init__(self, c): self.mycolor = c self.verbose = True @abstractmethod def make_move(self, b) -> Tuple[Board, str]: pass def reset(self): pass def contemplate(self, b, game_over=False): pass def setVerbosity(self, verbose): self.verbose = verbose

, where *make_move* is the player’s method to make a move given a board, *reset* is a way to reset any player state between consecutive games, and *contemplate* is for giving a player a chance to reflect upon the game after having made a move (most player types does not need to reflect upon the game but this is where one can hook in some AI/ML code to give any AI player a chance to learn something about the game.

A very naive player that just make any random move at a given board state would then be implemented like so:

class RandomPlayer(Player): def __init__(self, c): super().__init__(c) def make_move(self, b) -> Tuple[Board, str]: boards = get_possible_boards(b, self.mycolor) if len(boards) > 0: b, _, coord = random.choice(boards) return (b, coord) return (None, '')

…or an interactive player that ask the user in front of the computer what game to play would be implemented like so:

class InteractivePlayer(Player): def __init__(self, c): super().__init__(c) def make_move(self, b) -> Tuple[Board, str]: boards = get_possible_boards(b, self.mycolor) if len(boards) > 0: done = False while not done: b.print() move = input("enter your move: ") if move == 'q': done = True if len(move) == 2: col = move[0] row = move[1] x = ord(col) - ord('a') y = ord(row) - ord('0') - 1 if make_move(x, y, self.mycolor, b) > 0: return (b, move) else: print("not a valid move") else: print("not a valid move") return (None, '')

Any beginner of Othello will likely also go through a phase where they eagerly try to, in every move, try to capture as many pieces as possible (which is usually not a very successful strategy in Othello):

class GreedyPlayer(Player): def __init__(self, c): super().__init__(c) def make_move(self, b) -> Tuple[Board, str]: boards = get_possible_boards(b, self.mycolor) if len(boards) > 0: b, _, coord = boards[0] return (b, coord) return (None, '')

So, with some player types implemented we can define how a game is played (an Othello game is played until the entire board is full or that no player can make a move):

def play_game(blackPlayer, whitePlayer, verbose=True, allowContemplation=True): b = Board() blackPlayer.reset() whitePlayer.reset() blackPlayer.setVerbosity(verbose) whitePlayer.setVerbosity(verbose) coords = [] while True: # black player makes a move (if possible) blackMadeMove = False move, coord = blackPlayer.make_move(b) if move is not None: b = move blackMadeMove = True coords.append(coord) if verbose: print('black plays {}'.format(coord)) b.print() print("---") if allowContemplation: blackPlayer.contemplate(b) # white player makes a move (if possible) whiteMadeMove = False move, coord = whitePlayer.make_move(b) if move is not None: b = move whiteMadeMove = True coords.append(coord) if verbose: print('white plays {}'.format(coord)) b.print() print('') print("---") if allowContemplation: whitePlayer.contemplate(b) # if no player could make a move we are done if not blackMadeMove and not whiteMadeMove: if verbose: print("no player can make a move - game over") break # count final standing whites, blacks = b.count() if allowContemplation: blackPlayer.contemplate(b, True) whitePlayer.contemplate(b, True) return (whites, blacks, coords)

A game can then be played, for instance between the Greedy player and the Random player, like so:

whites, blacks, coords = play_game(GreedyPlayer(BLACK), RandomPlayer(WHITE), verbose = True) if whites > blacks: print("WHITE WON!") else: if blacks > whites: print("BLACK WON!") else: print("THERE WAS A TIE!") print('({} whites vs {} blacks)'.format(whites, blacks))

So, then, if we want to make a player that uses temporal difference learning we can build it like so if we use the keras library to represent the neural network:

from keras.models import Sequential from keras.layers import Dense from keras.layers import InputLayer import tensorflow as tf def get_model(n_inputs, n_outputs, n_hidden): model = Sequential() model.add(InputLayer(input_shape=(n_inputs, ), name='input_layer')) model.add(Dense(n_hidden, kernel_initializer='he_uniform', activation='sigmoid', name='hidden_layer')) model.add(Dense(n_outputs, name='output_layer', activation='sigmoid')) model.compile(loss='mae', optimizer='adam') return model import pickle class NN: def __init__(self, params): self.model = get_model(**params) def evaluate(self, board) -> float: X_values = board.toInputVector() return self.model.predict(X_values, verbose=0)[0][0] def trainable_variables(self): return self.model.trainable_variables ALPHA = 0.1 LAMBDA = 0.7 class TDPlayer(Player): def __init__(self, c, model = None): super().__init__(c) params = {'n_inputs': 128, 'n_hidden': 60, 'n_outputs': 1} if model is not None: self.model = model else: self.model = NN(params) self.predictions = [] # predictions holds tuples of (board, prediction), from first move to last self.t = 0 self.grads = [] self.gamesPlayed = 0 def save(self, fname): with open(fname, 'wb') as file: persistedState = {'gamesPlayed': self.gamesPlayed, 'model': self.model} pickle.dump(persistedState, file) def load(self, fname): with open(fname, 'rb') as file: self.reset() persistedState = pickle.load(file) print(persistedState) self.gamesPlayed = persistedState['gamesPlayed'] self.model = persistedState['model'] def reset(self): self.predictions = [] self.t = 1 self.grads = [] self.gamesPlayed += 1 def make_move(self, b) -> Tuple[Board, str]: boards = get_possible_boards(b, self.mycolor) bestCandidate = None bestCoord = '' bestSoFar = -np.Inf if len(boards) > 0: for bs in boards: candidate, _, coord = bs p = self.model.evaluate(candidate) if self.mycolor == BLACK: p = 1 - p if p > bestSoFar: bestCandidate = candidate bestCoord = coord bestSoFar = p return (bestCandidate, bestCoord) def contemplate(self, b, game_over=False): # adjust the weights according to TD(λ) X_values = b.toInputVector() with tf.GradientTape() as tape: p = self.model.model(X_values) trainable_vars = self.model.trainable_variables() gradients = tape.gradient(p, trainable_vars) self.predictions.append(p[0][0]) self.grads.append(gradients) t = self.t if len(self.predictions) > 1: # update weights in all layers (len(trainable_vars) == len(grads), but each trainable_vars[i] is a tensor but grads[i] is a list) for layer in range(len(trainable_vars)): layer_trainable_vars = trainable_vars[layer] weighted_gradient_sum = tf.zeros(shape=layer_trainable_vars.shape) for k in range(1, t): layer_gradients_k = tf.Variable(self.grads[k-1][layer], shape=layer_trainable_vars.shape) weighted_gradient_sum = tf.math.add( weighted_gradient_sum, tf.math.scalar_mul(LAMBDA ** (t - k), layer_gradients_k)) Y_t_plus_1 = self.predictions[t-1] #this would be the real outcome if we reach end of game if game_over: # count final standing whites, blacks = b.count() if whites > blacks: Y_t_plus_1 = 1.0 else: if whites == blacks: Y_t_plus_1 = 0.5 else: Y_t_plus_1 = 0.0 Y_t = self.predictions[t-2] reward = ALPHA * (Y_t_plus_1 - Y_t) weight_delta = tf.math.scalar_mul(reward, weighted_gradient_sum) layer_trainable_vars.assign_add(weight_delta) self.t += 1

So, in every step it will make a prediction of the outcome by encoding the board state for all possible moves and running it through the neural network. The single output neuron represents the likelihood that white will win, so a black TDPlayer will chose the move that has the minimum output value in the output neuron and the opposite for a white TDPlayer. The contemplate function will be called after each move and it will update the weights according to the TD(λ) formula suggested by Sutton et.al., which, on some higher level, means that the weights are modified in a manner to make any prediction about the final game state be more close to the next prediction made in the game (which is intuitive in a way as any later prediction is likely more correct than any previous prediction).

, where t is the move number, P is the likelihood of white winning and ????w is the gradient of the weight when doing the prediction at any given game state. So in order to update the weights in every state one needs to do some bookkeeping of predictions and gradients while playing the game.

Once the TDPlayer is implemented one can tell it to play against itself in order to learn some stuff about the game, like so:

# self training F_NAME = 'tdPlayer.pickle' # initialize white player whitePlayer = TDPlayer(WHITE) # load model from file whitePlayer.load(F_NAME) # initialize black player with white's model (so they use the same model) blackPlayer = TDPlayer(BLACK, model=whitePlayer.model) N_EPOCHS = 200 N_GAMES_PER_EPOCH = 500 N_TESTS_PER_EPOCH = 100 for epoch in range(N_EPOCHS): print('training epoch {}'.format(epoch+1)) for g in range(N_GAMES_PER_EPOCH): if g % 5 == 0: print('.', end = '') play_game(blackPlayer, whitePlayer, verbose=False) whiteWins = 0 blackWins = 0 print('\ntesting after epoch {}'.format(epoch+1)) for t in range(N_TESTS_PER_EPOCH): if t % 5 == 0: print('.', end = '') whites, blacks, coords = play_game(RandomPlayer(BLACK), whitePlayer, verbose=False) if whites > blacks: whiteWins += 1 else: if blacks > whites: blackWins += 1 print('\ntest results after epoch {}'.format(epoch+1)) print('white wins: {}, black wins: {} ({}%)'.format(whiteWins, blackWins, round(100.0 * whiteWins / N_TESTS_PER_EPOCH, 2))) whitePlayer.save(F_NAME)

So, in the example above, a white TDPlayer will play against a black TDPlayer 200 epochs of 500 games each. After each epoch of 500 games we will do a verification of its skills by letting a white TDPlayer play against a black RandomPlayer. After a few thousands of games played, where weights are adjusted after each move, one can see that the TDPlayer is capable of beating the RandomPlayer more than 95% of the time. But what does that mean really? The RandomPlayer is an extremely bad player (except for when it gets very lucky obviously). A reasonably good player should be able to beat it hundred times out of hundred and I still have not managed to train it well enough for it to beat me, which was the real goal of the project. Its learning rate seems to slow down also. The first hours it went quickly above 60% wins, but now, for the last couple of days, it is struggling to push it further and closer to the 100% wins. So if anyone has any input on anything written above, if I have made any errors for instance, please don’t be shy to let me know as I am eager to improve this model. If anyone has more input about the TD-learning approach I would be interested to hear also. I am particularly curious about how the weight update formula came about and what it really means. Normally, in neural networks you have, a loss function which you try to minimise by some gradient descent method, but in this TD approach it is not obvious to me what the loss function we are trying to minimise really is. Since the comments fields of blogs like these are typically just filled with bot content I can be reached at this address if you have any comments.

The code for the experiment is posted at https://github.com/peffis/othello