Google
 
Back to Darren's Computer Go Pages Top

The Problem With Random Playouts

Introduction

Monte carlo techniques have started to get good results in 9x9 in the past two or three years. Quick summary: from any board position you can evaluate it by playing hundreds or thousands of random games. Even though each game has some insanely stupid moves, taken as a whole intelligence emerges: the most popular estimate can be taken as the evaluation of the position (and so by trying each move and analyzing in this way you can choose the best move). You can also merge the terminal positions of all those random games to built up a probability map of territory ownership.

Monte carlo by itself (with random playouts as described above) only ever beats other weak programs as far as I know (I do not believe it can beat GnuGo for instance). It was the advent of UCT that made monte carlo strong. MoGo, and other UCT programs, are clearly the world's strongest at 9x9 go. At 19x19 the situation is less clear: they seem to be about the same level as the strong traditional programs but using considerably more computer power.

My first reaction to all this was it cannot be true: how can random play create a strong player?! But the results the strong UCT programs have been getting are irrefutable. Now I have spent some time studying and experimenting with the algorithms, I have come full circle: this article will discuss why I believe that random play cannot create a strong [1] player.

(This article does not discuss UCT directly. For more on that see Sensei's Library or Wikipedia. See the libego code for a free (GPL) UCT implementation.)

(Minor update, 15 months on: what I refer to as UCT in this article is now more commonly called MCTS (monte-carlo tree search). Also MCTS programs now dominate computer go even at 19x19.)

Types Of Playout

Playouts come in two sizes: light and heavy. Light is usually used as a synonym for uniform random [2], though some simple heuristics also get classed as light. The basic idea is that heavy playouts contain some go knowledge but take longer to run. Heavy playouts can (and, I believe, should) still contain a random element.

Generally heavy playouts make for a strong program. Each playout takes longer to run, meaning that fewer playouts can be run in the same amount of time, but the increase in focus more than makes up for this. All the strong UCT programs are using heavy playouts.

Don Dailey has performed a very interesting experiment comparing GnuGo, a light-playout UCT program and a heavy-playout UCT program. (See the Compgo mailing list thread). Unfortunately the graph seems to be off-line at the moment. But there were two important points the graph showed. First, each doubling of playouts showed an improvement in strength, even for light playouts: it didn't flatten out within 2 million playouts. Second, heavy playouts were considerably stronger at all playout levels.

It may or may not be important that Don's light playouts are not completely random. It prefers captures: testing N moves and if one of them is a capture it is played, otherwise a non-eye-filling random move is chosen. Suicide is never allowed, simple ko is tested for in the playouts but not super ko.

Why Random Sucks: The Theory

Random playouts as an evaluation technique works well when, at every level in the search tree, the proportion of good moves to bad moves is the same for both players. It also works well when there are a large number of good moves to choose from. It works poorly when more precise play is required.

As our first example, consider the board after 20 moves by two random players. There are no patterns or connections, few touching stones. Let's say that black stones are in a slightly better position, enough that if two strong players took over the game, black would be likely to win. It is likely that enough random playouts from this position will also discover that black has an advantage.

As our second example if there is a single living/killing move for a large group, which is big enough that it will decide the result of the game. If there are 24 other legal moves, then the random playouts will only find the correct move 4% of the time. If the opponent was perfect then it'd lose 96% of the time, even though it has a winning position. However in random playouts the opponent is also random. So what happens is that in most lines they will both keep missing the key move, until one player hits upon it accidentally. Random is still not doing too bad: player to move first should win 100% of the playouts, but he still manages to win about 50% of them. (See [8] for an artificial position that shows this clearly.)

As our third example, what if there is one living move, 24 other legal moves, but this time there are 10 killing moves. The player to move first still only has a 4% of hitting upon it. But now the second player has a 40% chance of killing and winning the game. Very roughly, the second player (the one who should lose the game) is going to win 90%. So, instead of first player winning 100% of the playouts (as he would with perfect play), he only manages to win 10%.

In our third example the ratio of good to bad moves is not the same for both players. This is when random playouts fare worst.

Example terminal positionA position of the third type is shown to the right. This is a finished game, both players have passed and it is a 2.5pt win for white (at 5.5pt komi). But white's position is more fragile than black's and random playouts exploit that: more times than not black will end up killing either the top or the bottom-right white group and declaring this position is a win for black!

In games between weak players such positions are quite rare. But between strong players they are the norm. Why?

Strong players are being forced by the strong play of the opponent to choose more complex lines. More precisely, when the players come out of the early middle game (for instance after 15 moves on a 9x9 board) both players can see that with quiet play one side has won. Let's assume it is black that will win with quiet play. White will therefore be trying to complicate the game, while black will be playing defensive moves trying to defend the win. Therefore though white really wants to choose a less complex line (the "10 ways to live"), if that line looks like a loss by 1pt he will instead choose the more complex line (the "just one way to live"). For his part black will try to choose clearer lines, and avoid complexity. (Strong players are still human: they have not read out every line perfectly, and are mostly playing by patterns and heuristics. So I do not mean they are seeing the 1 way to live versus the 10 ways to live, just that they are either encouraging or avoiding complexity.)

So, two strong players will tend to create positions that are exactly the kind that random playouts will get wrong: delicately balanced positions where the ease of play is different for black and white. Weak players on the other hand tend to create positions where random playouts can give a more accurate reading.

Why Random Sucks: Experimental Evidence

I collected a number of terminal positions from professional 9x9 games as a test suite [3]. It is a mix of games: some went right to the endgame, some were mid-game resignations. Some were variations discussed by a commentator and so just stop when it has been shown the alternative line is slightly strong or weaker, but where the exact result may still not be very clear.

I then used both GnuGo and crazy stone to score the final positions. Only when they agreed [4] on the result was the position used. They agree about 75% of the time. I have not checked them but it seems fair to assume that the rate of agreement is much higher on endgames that were fully played out, and that agreement was poorer where one side resigned (i.e. due to a life/death mistake that perhaps only a professional player can see).

So, the final test suite (of 103 positions) is made up of positions that two reasonable computer go programs (with different styles) seem to understand well enough to score accurately, and I assume they are mostly finished games.

On average the random player played another 52 moves, and got the result correct only 53.25% of the time (by definition, GnuGo, and a heavy-playout UCT program like CrazyStone, score 100%). This was with 100 playouts per end position, which is enough to be within 1% of the true value [5][6]. (Important point: though the UCT algorithms get better with more playouts, monte carlo evaluation only gets more statistically meaningful, it does not get a higher score.)

So How Come...

A UCT program is getting the evaluation wrong half the time even 50 moves deep in the tree, and all that bad information is being fed up the tree. Obviously it is going to play bad moves and be very weak. So how come UCT with random playouts plays so strongly? Don's experiments show that even light playouts can beat GnuGo if you increase the number of playouts.

One aspect is that those "light" playouts were not completely random. But I think the main reason is that GnuGo is not that strong, and can be beaten with approximate play. Like any expert system GnuGo is brittle, and it makes mistakes at all stages of the game. A third reason is that UCT programs tend to be stronger in the endgame than the opening. So once they are doing enough random playouts to get through the opening without being too far behind it can exploit GnuGo's endgame weaknesses. And in the opening and early middle game there are many options, the positions are still flexible, and the kind of positions where random playouts will go wrong have yet to be created.

Conclusion

Random playouts are a quick and easy way to write a weak go program but they will never make a really strong go program, no matter how clever your UCT algorithm is made. Never is a strong word: I am assuming reasonable CPU resources. It is already known that UCT will eventually explore every branch exhaustively and so produce perfect play. But the resources needed to do that exceed those of good old-fashioned alpha-beta. If perfect play is your goal you may as well forget UCT with random playouts and just do an exhaustive alpha-beta and save some CPU and memory resources.

My other point with this article is that we are back where we started before UCT and monte carlo came along: all the action in computer go is still about clever use of heuristics/patterns/tactical search/whatever to narrow the search. Producing research on the UCT algorithm is wasting time if you are using random playouts, as the same results may not apply to more intelligent playouts. Similarly, time spent optimizing random playouts is wasted effort, as those same optimizations may actually slow down heavy playouts and may make the development of heavy playouts harder [7].

But UCT Is Still Great.

My criticism is of random playouts, not UCT. I like the UCT algorithm and think it will be an essential part of a strong computer go program. What it brings to the table is that our clever use of heuristics/patterns/tactical search/etc. to narrow the search no longer needs to be exact. It can make mistakes. We just need to add enough random noise (or different heuristics that are good in different situations) so that it will be correct some of the time. UCT can take it from there. Or, rephrasing, we can weaken our brittle algorithms with a bit of randomness and UCT will make them strong while also making them adaptable.

Notes

[1]: By strong player I mean a dan-level player. Specifically, if the opponent is strong enough to accurately count a (9x9) game at move 10-15 (making the assumption of quiet play and standard moves by both sides for the rest of the game) he will adjust his style of play accordingly, and I do not think a UCT program with uniform random playouts will be able to beat him without so many playouts that almost the whole tree is being expanded out.

[2]: What I am calling random playouts for the purposes of this article give all legal moves equal weight and randomly chooses one of them, and this process is used for both players all the way to the end of the game. Single-point eyes are not filled. Different programs allow single or multi-stone suicide, or multi-stone suicide only, or no suicide at all. For the thesis presented here, this difference does not matter.

[3]: In a recent paper Combining Online and Offline Knowledge in UCT (Sylvain Gelly, David Silver), section 5, they describe a test suite of 200 hand-labeled positions that is similar to my above test suite: "In our experiments with MoGo the [score on this test suite] appears to have a close relationship with playing strength." I know some people believe the only way to measure algorithm improvements is by playing full games, but this supports my view that scores on test suites can give strong correlation to actual playing strength. Which is good because they are so much quicker to run.

[4]: Score agreement definition. I am ignoring komi, and negative is good for black. In other words, -6 is a win for white at 6.5pt komi but a win for black at 5.5pt komi. For scores of -4, -5, -6, -7 and -8 I require an exact match. But for -3 or higher I just require that both opinions say -3 or higher (i.e. both agree that it is a big white win). And similarly for -9 or less I only require that both opinions said -9 or less (i.e. they both say it is a big black win). This same comparison method was used in the experiment when deciding if an answer was correct or not.

[5]: Don Daily reported he gets improvements in a non-UCT nearly-uniform-random monte carlo program up to about 5000 playouts (i.e. measurable increase in playing strength compared to 2500 playouts). However this is over the whole game. We can assume many more playouts are needed to get a statistically reasonable estimate in the opening, compared to later stages of the game. By the time the game is mostly finished and fairly stable, the experimental evidence shown here says that 30-60 playouts is sufficient to get a result about as good as random playouts are going to give.

[6]: In fact just 30 playouts is enough to get a fair idea. At 30 playouts it was correct 51.74%, at 60 playouts 54.06%, at 200 playouts it is 53.87%, at 1000 playouts it was correct 54.17%.

[7]: There is no strict definition of what a heavy playout is, it just means more intelligent than random. Here are some suggestions I picked up from the compgo mailing list (unintentional, but all three posts are by Valkyria author Magnus Persson): general advice, general advice, on ladders.

[8]: Heikki Levanto posted these in a Dec '07 compgo mailing list thread. The evaluations are by libego, which is using random playouts. In the first position black thinks he is losing (only 19% of playouts lead to a win), though in fact white in the bottom-left is clearly dead. (It looks like black will always capture white, whatever happens (libego plays all legal moves, including throw-ins). But if black plays H1 or J2 at any time in any playout the tables turn and he dies. White can never play H1 or J2 so it is probable black will play one of those moves in every playout! The exception is black will not play them once E1, E2 and E3 have been filled in, but that is less likely than playing H1 or J2 first!)

The second position is of even more interest in relation to this article. It has a critical point at A1: if white plays there he wins. If white plays anywhere else and black plays A1 then white loses. (I assume black plays first in the playouts, which explains the 77% chance of winning.)

playout_benchmark 10000
= Initial board:
komi 7.5
   A B C D E F G H J
 9 . . . . . O O O O 9
 8 O O O O O O O O O 8
 7 O O O O O O O O O 7
 6 O O O O O O O O O 6
 5 # # # # # # # # # 5
 4 O O O # # # # # # 4
 3 O O O O . # # # # 3
 2 . O O O . # # # . 2
 1 # . O O . # # . # 1
   A B C D E F G H J
Performance:
  10000 playouts
  0.032002 seconds
  312.481 kpps
Black wins = 1937
White wins = 8063
P(black win) = 0.1937


playout_benchmark 10000
= Initial board:
komi 7.5
   A B C D E F G H J
 9 . # . . . O O O O 9
 8 O O O O O O O O O 8
 7 O O O O O O O O O 7
 6 O O O O O O # # # 6
 5 # # # # # # # # # 5
 4 O O O # # # # # # 4
 3 O O O O . # # # # 3
 2 . O O O . # # # . 2
 1 . . O O . # # . # 1
   A B C D E F G H J
Performance:
  10000 playouts
  0.084006 seconds
  119.039 kpps
Black wins = 7746
White wins = 2254
P(black win) = 0.7746


© Copyright 2007 Darren Cook <darren@dcook.org>

Revision History
2007-07-17: First release
2007-07-26: Minor revisions based on reviewer feedback
2007-08-02: Revisions based on feedback on computer-go mailing list. Some content moved to notes, and notes expanded.
2007-12-10: Added note [8]. Fixed broken link.
2008-10-15: Minor updates.