## Friday, March 30, 2018

### Monte Carlo tree search

This is an attempt to explain MCTS to myself. If I made any mistakes let me know in the comments.

Let's say we have to make three sets of binary decisions (e.g. left/right). This leads to $2^3$ = 8 possible outcomes (A-H) and let's say three of these (A, E, and F) are desirable (i.e. "winners"). We'd like to find one or more of these, without having to go through all the options, which can be represented by a tree-graph as shown here (the circles as called nodes and the lines are called edges).

We start at the top of the tree and pick one path on each side of the tree at random, basically by flipping a coin at each node to decide wether we go left or right (hence Monte Carlo). Let's say we end up at point A and G.

We now go to the second layer of the graph and on each node record whether we found a winner ($w_i$ = 1) or not ($w_j$ = -1) and the number of times we visited the node ($n_i = n_j$ = 1). In the top node we add the wins and losses (1 + -1 = 0) and the total number of attempts ($N$ = 2).

Now we compute the "action" ($a_i$) for each edge in the first branch point according to
$$a_i = \frac{w_i}{n_i}+c \sqrt{\frac{\ln (N)}{n_i}}$$
$c$ is the exploration parameter that, theoretically, is $\sqrt{2}$ but in practise is an empirical parameter. Here I'll use $c$ = 4.

Now we start at the top node and choose the node with the highest action, which in this case it the left one. From this node we choose a direction at random and repeat this process until we hit the end of the tree. Let's say we ended up at C. C is not a winner so we update the $w_i$s and $n_i$s on this path and add  $w_i$ and $n_i$ to the first node we chose at random.

Now we update the actions. Note that because $N$ changes, the action on the right hand side of the tree also changes. In fact, that action is now larger than on the left, so in our fourth exploration we'll now start exploring the right side of the tree further.

We now repeat this process as many times as we can afford. The path with the most action at the end of the simulation is chosen as the best one.

You'll notice that paths A and B will never be explored again because we (arbitrarily) chose to branch to the right at some point. So another option that's often used, is to chose random paths leading in both directions as we did in the very first step.