AlphaGO - What did they do?

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.

Technique

AlphaGO introduces a new approach to computer Go that uses value networks to evaluate board positions, policy networks to select moves and a new search algorithm that combines Monte Carlo simulation with value and policy networks.

These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play.

Supervised learning of policy networks

The SL policy network alternates between convolutional layers with weights, and rectifier nonlinearities. A final layer outputs probability distribution over all legal moves. The input to the policy network is a simple representation of the board state.

The policy network is trained on randomly sampled state-action pairs, using stochastic gradient ascent to maximize the likelihood of the human move a selected action in that state.

The result is a 13-layer policy network, called the SL policy network, from 30 million positions from the KGS Go Server.

Reinforcement learning of policy networks

The RL policy network is identical in structure to the SL policy network, and its weights are initialized to the same values. During its training it plays games between the current policy network and a randomly selected previous iteration of the policy network randomizing from a pool of opponents, in this way stabilizes training by preventing overfitting to the current policy. it uses a reward function that is zero for all non-terminal time steps. The outcome is the terminal reward at the end of the game from the perspective of the current player at time step for winning and for losing. Weights are then updated at each time step by stochastic gradient ascent in the direction that maximizes expected outcome

​​

Reinforcement learning of value networks

The final stage of the training pipeline focuses on position evaluation, estimating a value function that predicts the outcome from the position of games played by using policy for both players.

The weights of the value network are trained by regression on state-outcome pairs, using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value, and the corresponding outcome

To mitigate overfitting, a new self-play data set consisting of 30 million distinct positions, each sampled from a separate game was created.

Each game was played between the RL policy network and itself until the game terminated.

Searching with policy and value networks

AlphaGo combines the policy and value networks in an MCTS algorithm that selects actions by lookahead search. Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs and computer policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs.

Conclusion

AlphaGo achieved a 99.8% winning rate against other Go programs and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

If you find this article interesting please join our mailing list to not miss any update and schedule a free no obligation meeting with us to discuss your Analytics and AI needs.

Cheers, Andrea