当前位置：网站首页>Master go game through deep neural network and tree search

Master go game through deep neural network and tree search

2022-07-08 02:07:00 【Wwwilling】

Article

author ：David Silver*, Aja Huang*, Chris J. Maddison etc.
Literature title ： Master the go game through deep neural network and tree search
Document time ：2016
Journal Publishing ：nature
https://github.com/jmgilmer/GoCNN

Abstract

Due to its huge search space and the difficulty of evaluating the position and movement of the chessboard , Go has always been regarded as the most challenging game in the classic AI Games . ad locum , We introduce a new method of computer go , It USES “ Value network ” To evaluate the chessboard position and “ Policy network ” To choose how to walk . These deep neural networks are trained through a novel combination of supervised learning of human expert games and reinforcement learning of self Games . Without any forward-looking search , Neural networks play go at the level of the most advanced Monte Carlo tree search program , Simulate thousands of random self Games . We also introduce a new search algorithm , It combines Monte Carlo simulation with value and strategy Networks . Use this search algorithm , Our program AlphaGo Other go programs have achieved 99.8% Winning rate , With 5 Than 0 Defeated the human European go champion . This is the first time that computer programs have completely defeated human professional chess player go , It used to be thought that it would take at least ten years to achieve this feat .
All perfect information games have an optimal value function $v^*(s)$ , It determines that when all players play perfectly , Each chessboard position or state $s$ The result of the game . These games can be played by including about $b^d$ Recursively calculate the optimal value function in the search tree of a possible moving sequence to solve , among $b$ Is the breadth of the game （ Number of legal moves per location ）, $d$ Is its depth （ Game length ）. In a big game , For example, chess $(b \approx 35, d \approx 80)$ , Especially go $(b \approx 250, d \approx 150)$ , Exhaustive search is not feasible , But there are two general principles to reduce the effective search space . First , The depth of search can be reduced by location evaluation ： Truncation state $s$ Search tree at , And use an approximation function $v(s)≈v^*(s)$ Replace $s$ The subtree below , This function can predict the state $s$ Result . This method is used in chess 、 He achieved superhuman performance in checkers and black and white , But due to the complexity of the game , People think it is difficult to control in go . secondly , You can learn from the policy $p (a ∣ s)$ Middle sampling action to reduce the breadth of search , Strategy $p (a ∣ s)$ Is the position $s$ May move $a$ Probability distribution of . for example , Monte Carlo passes from strategy $p$ The long action sequence of two players is sampled , Search to the maximum depth without branching at all . Averaging such launches can provide effective location evaluation , Achieve superhuman performance in backgammon and scrabble , Achieve a weak amateur level in go .
Monte Carlo Tree search (MCTS) Use Monte Carlo rollouts To estimate the value of each state in the search tree . As more simulations are performed , The search tree becomes bigger , The correlation value becomes more accurate . By selecting children with higher values , The strategy for selecting actions during search has also improved over time . Asymptotically , The strategy converges to the optimal playing method , And the evaluation converges to the optimal value function . The most powerful Go program at present is based on MCTS, And through training to predict human expert action strategies have been enhanced . These strategies are used to narrow the search scope to a series of high probability actions , And sample the action during the launch . This method has achieved a strong amateur play . However , Previous work has been limited to shallow strategies or value functions based on linear combinations of input features .
We use pipeline To train neural networks （ chart 1）. We first learn directly from expert human movement training supervision (SL) Policy network $p_σ$ . This provides fast feedback through instant feedback and high-quality gradients 、 Efficient learning . Similar to previous work , We also trained a quick strategy $p_π$ , You can quickly sample actions during deployment . Next , We train an intensive learning (RL) Policy network $p_ρ$ , It improves by optimizing the final result of self game SL Policy network . This will adjust the strategy towards the right goal of winning the game , Instead of maximizing prediction accuracy . Last , We trained a value network $v_θ$ , It can predict RL The winner of the game between strategy network and itself . Our program AlphaGo Integrate strategies and value networks with MCTS Combine effectively .
chart 1 | Neural network training pipeline and architecture .
a, Training Quick Launch Strategy $p_π$ And supervised learning (SL) Policy network $p_σ$ To predict the movement of human experts in location data sets . Reinforcement learning (RL) Policy network $p_ρ$ Is initialized to SL Policy network , Then it is improved by strategy gradient learning , To maximize the results against the previous version of the strategic network （ Win more games ）. A new dataset is created by RL Strategic network is generated by self game . Last , By returning to the training value network $v_θ$ , To predict the expected results of the location in the self game data set （ That is, whether the current player wins ）.
b,AlphaGo Schematic diagram of neural network architecture used in . The strategic network will place the chessboard $s$ As its input , Pass it through the parameter $σ$ （SL Policy network ） or $ρ$ （RL Policy network ） Many convolutions of , And output the probability distribution $p_σ(a|s)$ or $p_ρ (a|s)$ By legal movement $a$ , Represented by the probability diagram on the chessboard . Value networks similarly use many parameters $θ$ The convolution of layer , But output a scalar value $v_θ(s')$ To predict the location $s^{'}$ The expected result of .

Supervised learning of strategy network

For the first stage of the training pipeline , We build on previous work using supervised learning to predict expert actions in go games . SL Policy network $p_σ(a|s)$ When the weight is $σ$ Alternating between convolution layer and rectifier nonlinearity . the last one softmax Layer outputs all legal moves $a$ Probability distribution of . Input of policy network $s$ It is a simple expression of the status of the board （ See extended data sheet 2）. The policy network is in the state of random sampling - The action is right $(s, a)$ Training on , Use random gradient rise to maximize in state $s$ Human movement of choice $a$ The possibility of .
We from KGS Go Server Of 3000 Ten thousand positions trained one 13 Layer policy network , We call it SL Policy network . Compared with the latest technology , The network prediction expert moves on the reserved test set , The accuracy of using all input features is 57.0%, The accuracy of using only the original chessboard position and movement history as input is 55.7% From other research groups 44.4% On the date of submission （ Extended data table 3 Complete results in ）. A small increase in accuracy leads to a significant increase in performance intensity （ chart 2a）; A larger network can achieve better accuracy , But the evaluation speed is slow in the search process . We also use a weight of π The linearity of small pattern characteristics softmax（ See extended data sheet 4） Trained a faster but less accurate rollout Strategy $p_π(a|s)$ ; This achieved 24.2% The accuracy of , Use only 2μs To choose an action , Not the strategy of the network 3ms.
chart 2 | Strength and accuracy of strategy and value network .
a, A graph showing the playback intensity of the strategy network as a function of its training accuracy . Evaluate each layer regularly during training 128、192、256 and 384 A strategy network of convolution filters ; This figure shows the network using this policy AlphaGo Matching version AlphaGo The winning rate .
b, Value networks and different strategies rollouts Comparison of evaluation accuracy . Locations and results are sampled from human expert Games . Every location passes through the value network $v_θ$ A single forward pass of , Or through 100 Time rollout The average result of , Use uniform random rollout、 Fast rollout Strategy $p_π$ 、SL Policy network $p_σ$ or RL Policy network $p_ρ$ To assess the . The mean square error between the predicted value and the actual game result depends on the game stage （ How many steps have been taken in setting the position ） draw .

Reinforcement learning of strategy network

The second stage of the training pipeline aims to strengthen learning through strategic gradients (RL) Improve the policy network . RL Policy network $p_ρ$ Structurally, it is similar to SL The policy is the same as the network , Its weight $ρ$ Is initialized to the same value , $ρ = σ$ . We are in the current strategic network $p_ρ$ Play the game with the randomly selected strategy network iteration before . Randomize from a group of pairs in this way to stabilize training by preventing over fitting the current strategy . We use the reward function $r (s)$ For all non terminal time steps $t < T$ zero . result $z_t=±r(s_T)$ Is in the time step $t$ From the perspective of current players , The final reward at the end of the game ： win +1, transport -1. Then at each time step $t$ The weights are updated by a random gradient rise in the direction of maximizing the expected results .
We assessed RL The performance of network strategy in game , Sample from the output probability distribution of each action $a_t \sim p_ρ(⋅|s_t)$ . When and SL When the strategic network has a frontal confrontation ,RL Strategic networks have won 80% The above battle SL Policy network . We also target the most powerful open source Go program Pachi Tested ,Pachi Is a complex Monte Carlo search program , stay KGS Top ranked 2 Amateur segment , Perform each step 100,000 Sub simulation . Don't use search at all ,RL Strategic networks are working with Pachi Won 85% The victory of the . by comparison , The latest technology used to be based only on supervision .
by comparison , The previous latest technology is only based on supervised learning of convolutional Networks , In the fight against Pachi Won 11% The victory of the , Against weaker programs Fuego Won 12% The victory of the .

Reinforcement learning of value network

The final stage of the training pipeline focuses on Position Evaluation , Estimate a value function $v^p(s)$ , This function uses strategies for two players $p$ To predict the location of the game being played $s$ Result .
Ideally , We want to know the perfect game $v^*(s)$ The optimal value function under ; In practice , We use RL Policy network $p_ρ$ To estimate the value function of our strongest strategy $v^{p_ρ}$ . We use a weight of $θ$ , ${v_\theta }\left( s \right) \approx {v^{ {p_\rho }}}\left( s \right) \approx {v^*}\left( s \right)$ Our value network $v_θ(s)$ To approximate the value function . The neural network has a similar architecture with the strategy network , But the output is a single prediction, not a probability distribution . We pass the State - The result is right $(s, z)$ To train the weight of the value network , Use random gradient descent to minimize the predicted value $v_θ(s)$ And the corresponding results $z$ Mean square error between (MSE) .
Naive methods of predicting game results from data containing complete games can lead to over fitting . The problem is that continuous positions are strongly correlated , Only one stone is different , But the whole game shares the return goal . When in this way KGS When training on a dataset , The value network remembers the results of the game rather than generalizing to new positions , On the test set 0.37 Minimum MSE, On the training set, it is 0.19. To alleviate the problem , We have generated a new self game data set , contain 3000 Ten thousand different locations , Each location is sampled from a separate game . Every game is RL Between the strategic network and itself , Until the end of the game . The training of this data set leads to the MSE Respectively 0.226 and 0.234, Indicates that the overfitting is minimal . chart 2b It shows the location evaluation accuracy of the value network , And use fast rollout Strategy pπ Of Monte Carlo rollout comparison ; The value function is always more accurate . Yes $v_θ(s)$ A single assessment of is also close to use RL Policy network $p_ρ$ Of Monte Carlo rollouts The accuracy of the , But the amount of calculation used is reduced 15,000 times .

Search with strategies and Value Networks

AlphaGo stay MCTS Algorithm （ chart 3） It combines strategy and value network , The algorithm selects actions by searching first .
chart 3 | AlphaGo Monte Carlo tree search in .
a, Each simulation has the maximum action value by selecting $Q$ And a priori probability depending on the edge storage $P$ Reward $u (P)$ To traverse the tree .
b、 Leaf nodes can be extended ; The new node consists of a policy network $p_σ$ Deal with it once , The output probability is stored as a priori probability of each action $P$ .
c、 At the end of the simulation , Leaf nodes are evaluated in two ways ： Use value networks $v_θ$ ; And by using the quick launch strategy $p_π$ Run the launch to the end of the game , And then use the function $r$ Calculate the winner .
d, Update action value $Q$ To track all evaluations in the subtree under the action $r (\cdot)$ and $v_θ(·)$ Average value .
Search every edge of the tree $(s, a)$ An action value is stored $Q (s, a)$ 、 Number of visits $N (s, a)$ And a priori probability $P (s, a)$ . The tree is traversed by simulation （ That is, drop the tree in the complete game without backup ）, Start from the root state . At each time step of each simulation $t$ , From the State $s_t$ Select an action in $a_t$ .
So as to maximize the action value and bonus
This is proportional to a priori probability , But it will decay with repeated visits to encourage exploration . When traversing in step $L$ Reach leaf node $s_L$ when , Leaf nodes can be extended . SL Policy network $p_σ$ Only handle leaf positions $s_L$ once . Output probability as every legal action $a$ The prior probability of $P$ Storage , $P(s|a)= P_σ(a|s)$ . Leaf nodes are evaluated in two distinct ways ： First , Through value networks $v_θ(s_L)$ ; secondly , According to the use of quick launch strategy $p_π$ The random results of $z_L$ , Until the terminal step $T$ until ; Using mixed parameters $λ$ Combine these assessments into leaf assessments $V(s_L)$
At the end of the simulation , Update the action value and access count of all traversal edges . Each edge accumulates the access count and average evaluation of all simulations through that edge
among ${s_L}^i$ It's No $i$ Leaf nodes of secondary simulation , $1 (s, a, i)$ Said in the first $i$ Whether the edges were traversed during the second simulation $(s, a)$ . When the search is complete , The algorithm selects the most visited mobile from the root location .
It is worth noting that ,SL Policy network $p_σ$ stay AlphaGo In the performance than stronger RL Policy network $p_ρ$ Better , This is probably because human beings have chosen a series of promising ways , and RL Optimized the single best walking method . However , From stronger RL Value function derived from policy network ${v_\theta }\left( s \right) \approx {v^{ {p_\rho }}}\left( s \right)$ stay AlphaGo Zhongbi from SL Value function derived from policy network ${v_\theta }\left( s \right) \approx {v^{ {p_\sigma }}}\left( s \right)$ Perform better .
Evaluation strategies and value networks require several orders of magnitude more computation than traditional search heuristics . In order to effectively MCTS Combined with deep neural networks ,AlphaGo Use asynchronous multithreading to search in CPU Perform simulation on , And in GPU Parallel computing strategy and value network . AlphaGo The final version of uses 40 Search threads 、48 individual CPU and 8 individual GPU. We also implemented a distributed version AlphaGo, It uses several machines 、40 Search threads 、1,202 individual CPU and 176 individual GPU. The method section provides asynchronous and distributed MCTS Full details of .

assessment AlphaGo Chess power

To evaluate AlphaGo, We are AlphaGo An internal game was held between the variant of go and several other go programs , These include the strongest business processes Crazy Stone and Zen, And the strongest open source program Pachi and Fuego. All these programs are based on high-performance MCTS Algorithm . Besides , We also include open source programs GnuGo, This is a use of MCTS The most advanced search method before Go Program . All programs are allowed to have 5 Second count time .
The result of the championship （ See the picture 4a） indicate , stand-alone AlphaGo It is much stronger than any previous Go program , In connection with other go programs 495 Won the game 494 site （99.8%）. To give AlphaGo Provide greater challenges , We also made four concessions （ That is, the opponent is free to go ） The match of ; AlphaGo Respectively by 77%、86% and 99% Our winning rate defeated Crazy Stone、Zen and Pachi. AlphaGo The distributed version of is significantly more powerful , For single machine AlphaGo The winning rate of the competition is 77%, The winning rate against other programs is 100%.
We also evaluated the use of value networks only $(λ = 0)$ Or use only rollouts $(λ = 1)$ Evaluate the location AlphaGo variant （ See the picture 4b）. Even if it is not launched ,AlphaGo It also exceeds the performance of all other go programs , This shows that the value network provides a feasible alternative to Monte Carlo evaluation in go . However , Mixed assessment (λ=0.5) Perform best , Winning rate against other variants ≥95%. This shows that the two location evaluation mechanisms are complementary ： The value network is close to strong but unrealistically slow pρ The result of the game played , and rollouts Can accurately score and evaluate weak but fast rollout Strategy pπ The result of the game played . chart 5 Shows AlphaGo Evaluation of real game location .
chart 4 | AlphaGo Competition evaluation .
a, Game results between different Go programs （ See extended data sheet 6-11）. Each program uses about 5 Second count time . In order to AlphaGo Provide greater challenges , Some procedures （ Pale upper bar ） Be given four pieces of chess （ That is, the free step at the beginning of each game ） Against all opponents . The evaluation of the procedure adopts Elo gauge 37：230 The difference of points corresponds to 79% The probability of winning , Roughly corresponds to KGS38 An amateur position advantage on ; It also shows the approximate correspondence with the human level , The horizontal line shows what the program gets online KGS Grade . The competition with human European champion fan Hui is also included ; These games use longer time control . Shows 95% The confidence interval of .
b, AlphaGo On a single machine , Performance of different component combinations . Only use the version of the policy network without performing any search .
c, Use asynchronous search （ The light blue ） Or distributed search （ Navy Blue ）, stay AlphaGo Search threads and GPU Conduct MCTS Research on scalability of , Every step 2 second .
chart 5 | AlphaGo（ Black chess , Playing chess ） How to choose chess moves in the informal game with fan Hui . For each of the following statistics , The position of the maximum value is indicated by an orange circle .
a, Use value networks $v_θ(s')$ Evaluate root location $s$ All the successors of $s^{'}$ ; Show the estimated winning percentage of the highest rating .
b, From the root of the tree $s$ Start each side $(s, a)$ Action value of $Q (s, a)$ ; Only average the value network evaluation $(λ = 0)$ .
c, Action value $Q (s, a)$ , Only in rollout Average the evaluation $(λ = 1)$ .
d, Directly from SL Strategic network mobility probability , $p_σ(a|s)$ ; Report as a percentage （ If it is higher than 0.1%）.
e, Select the percentage frequency of actions from the root during simulation .
f, come from AlphaGo The main variants of the search tree （ The path with the maximum number of accesses ）. These actions are presented in numbered order . AlphaGo Choose the walking method shown in the red circle ; Fan Hui responded with the action shown by the white square ; In his post game comments , He likes it better AlphaGo The way of prediction （ Marked as 1）.
Last , We will distribute the version AlphaGo And professional 2 A chess player 、2013、2014 and 2015 Fan Hui, the champion of the European Go Championship in, made a comparison . 2015 year 10 month 5 solstice 9 Japan ,AlphaGo A formal five game match with fan Hui . AlphaGo With 5 Than 0 Win the game （ chart 6 And extended data table 1）. This is the first time that a computer Go program has defeated a human professional player in a complete go game without obstacles —— Previously, it was believed that this feat would take at least ten years to achieve .
chart 6 | AlphaGo Competition with European champion fan Hui .
The movements are displayed in numbered order corresponding to the order in which they are played . Repeated movements at the same intersection are displayed in pairs below the chessboard . The first movement number in each pair indicates the time of repeated movement , At the intersection identified by the second mobile number （ See Supplementary information ）.

Discuss

In this work , We developed a go program based on the combination of deep neural network and tree search , It works at the level of the strongest human player , Thus, the artificial intelligence “ Major challenges ” One of . For the first time, we have developed an effective go move selection and position evaluation function , Based on deep neural network , These networks are trained through a novel combination of supervised learning and reinforcement learning . We introduce a new search algorithm , The algorithm successfully combines neural network evaluation with Monte Carlo rollouts Combination . Our program AlphaGo Large scale integration of these components in a high-performance tree search engine .
AlphaGo The match with fan Hui , Thousands of times less than the chess game between dark blue and Kasparov ; Compensate by selecting these positions more intelligently , Use policy network , And use value networks to evaluate them more accurately —— A method that may be closer to the human way of playing . Besides , Although dark blue relies on handmade evaluation functions , but AlphaGo The neural network of is trained directly from game playing through general supervision and reinforcement learning methods .
Go is in many ways a model of the difficulties faced by AI ： Challenging decision-making tasks 、 Difficult search space and such a complex optimal solution , So that it seems infeasible to use strategy or value function to approach directly . Previous major breakthroughs in computer go ,MCTS The introduction of , It has led to corresponding progress in many other fields ; for example , General games 、 Classic planning 、 Partial observation planning 、 Scheduling and constraints meet . By combining tree search with strategy and value network ,AlphaGo Finally reached a professional level in go , It provides hope that human level performance can now be achieved in other seemingly intractable AI fields .

Method

Question setting . Many complete information games , Like chess 、 Checkers 、 Othello 、 Backgammon and go , It can be defined as alternating Markov game . In these games , There is a state space $S$ （ The status includes the instructions that the current player wants to play ）; An action space $A (s)$ Defines any given state $\in S$ Legal actions in ; State transition functions $f (s, a, ξ)$ Defined in the state $s$ Choose the action $a$ And random input $ξ$ （ For example, dice ） Subsequent state after ; The last reward function $r^i(s)$ Describes the player $i$ In state $s$ Rewards received in . We limit our attention to the two person zero sum game , $r^1(s)=−r^2(s)=r(s)$ , With deterministic state transition , $f (s, a, ξ) = f (s, a)$ , Except in the terminal time step $T$ outside , Zero reward . The result of the game $z_t=±r(s_T)$ It's in the time step $t$ From the perspective of the current player, the terminal reward at the end of the game . Strategy $p (a ∣ s)$ It's a legal act $\in A(s)$ Probability distribution of . If according to the strategy $p$ Choose all the actions of the two participants , Then the value function is the expected result , namely $v^p(s)=E[z_t|s_t=s,a_t...T \sim p]$ . Zero sum games have a unique optimal value function $v^*(s)$ , It determines the perfect post game state of two players $s$ Result ,
Previous work . The optimal value function can be obtained by minimax（ Or equivalent negamax） Search recursive calculation . Most games are minimal for detailed imax It's too big for tree search ; contrary , By using an approximation function $v(s)≈v^*(s)$ Instead of terminal rewards to cut the game . Use alpha-beta pruning Depth first minimax search in chess 、 He achieved superhuman performance in checkers and black and white , But the effect is not good in go .
Reinforcement learning can directly learn the approximate optimal value function from self game . Most previous work has focused on features $ϕ (s)$ And weight $θ$ The linear combination of $v_θ(s)=ϕ(s)· θ$ . At chess 、 Checkers and go use time difference to learn training weights ; Or use linear regression in black and white and scrabble . TDOA learning is also used to train neural networks to approximate the optimal value function , Achieve superhuman performance in backgammon ; The convolution network is used to realize weak kyu Class performance .
Another method of minimax search is Monte Carlo tree search (MCTS), It estimates the optimal value of internal nodes by double approximation , ${V^n}\left( s \right) \approx {v^{ {p^n}}}\left( s \right) \approx {v^*}\left( s \right)$ . First approximation ${V^n}\left( s \right) \approx {v^{ {p^n}}}\left( s \right)$ Use $n$ Monte Carlo simulation is used to estimate the simulation strategy $P_n$ The value function of . The second approximation ${v^{ {p^n}}}\left( s \right) \approx {v^*}\left( s \right)$ Use simulation strategy $P_n$ Instead of minimax optimal action . The simulation strategy is based on the search control function $\left( { {Q^n}\left( {s,a} \right) + u\left( {s,a} \right)} \right)$ Choose action , for example UCT, This function selects child nodes with higher action values , ${Q^n}\left( {s,a} \right) = - {V^n}\left( {f\left( {s,a} \right)} \right)$ , Plus rewards $u (s, a)$ Encourage exploration ; Or in the state $s$ Without a search tree , It starts with a quick launch strategy $p_π(a|s)$ Sampling action in . As more simulations are performed and the search tree becomes deeper , Simulation strategies become more and more accurate statistics . In extreme cases , Both approximations become accurate and MCTS（ for example , Use UCT） Converge to the optimal value function ${\lim _{n \to \infty }}{V^n}\left( s \right) = {\lim _{n \to \infty }}{v^{ {p^n}}}\left( s \right) = {v^*}\left( s \right)$ . The most powerful Go program at present is based on MCTS.
MCTS It has been previously combined with a strategy for reducing the beam of the search tree to high probability movement ; Or shift the bonus items to high probability . MCTS It also combines a value function , Used to initialize the action value in the newly extended node , Or combine Monte Carlo evaluation with minimax evaluation . by comparison ,AlphaGo The use of the value function is based on the truncated Monte Carlo search algorithm , The algorithm terminates before the end of the game , And use value function instead of terminal reward . AlphaGo The location assessment is mixed with the complete rollout And truncated rollout, In some ways, it is similar to the well-known time difference learning algorithm TD(λ). AlphaGo It is also different from the previous work , It uses a slower but more powerful strategy and value function representation ; The evaluation depth neural network is several orders of magnitude slower than the linear representation , Therefore, it must occur asynchronously .
MCTS The performance of depends largely on the quality of the launch strategy . Previous work focused on learning through supervision 、 Reinforcement learning 、 Simulate balance or online adaptation to make patterns by hand or learn promotion strategies ; However , as everyone knows , be based on rollout The location assessment of is often inaccurate . AlphaGo It's relatively simple to use rollout, Instead, use the value network to solve the challenging problem of location evaluation more directly .
search algorithm . In order to effectively integrate large-scale neural networks into AlphaGo in , We implemented asynchronous policies and values MCTS Algorithm (APV-MCTS). Search every node in the tree $s$ Include all legal actions $\in A(s)$ The edge of $(s, a)$ . Each edge stores a set of Statistics ,
among $P (s, a)$ It's a priori probability , $W_v(s, a)$ and $W_r(s, a)$ Is the Monte Carlo estimation of the total action value , stay $N_v(s, a)$ and $N_r(s, a)$ Accumulate on ) They are leaf evaluation and rollout Reward , $Q (s, a)$ Is the combined average action value of this side . Multiple simulations are executed in parallel on separate search threads . APV-MCTS The algorithm is shown in Figure 3.
choice （ chart 3a）. The first of each simulation in-tree The stage starts at the root of the search tree , When the simulation is in time step $L$ End when reaching the leaf node . At each time step $t < L$ in , Select an action based on the statistics in the search tree , ${a_t} = \arg {\max _a}\left( {Q\left( { {s_t},a} \right) + u\left( { {s_t},a} \right)} \right)$ Use PUCT Variants of the algorithm p, among cpuct Is a constant that determines the level of exploration ; This search control strategy initially preferred actions with high a priori probability and low access times , But I gradually like actions with high action value . In each of these time steps , $t < L$ , Choose an action based on statistics In the search tree in ,t Use PUCT Variants of the algorithm $u\left( {s,a} \right) = {c_{puct}}P\left( {s,a} \right)\frac{ {\sqrt {\sum\nolimits_b { {N_r}\left( {s,b} \right)} } }}{ {1 + {N_r}\left( {s,b} \right)}}$ , among $c_{puct}$ Is a constant that determines the level of exploration ; This search control strategy initially preferred actions with high a priori probability and low access times , But I gradually like actions with high action value .
assessment （ chart 3c）. The value network will be leaf position $s_L$ Added to the queue for evaluation $v_θ(s_L)$ , Unless it has been evaluated before . The second stage of each simulation is derived from leaf nodes $s_L$ Start , Continue until the end of the game . In each of these time steps , $t \geq L$ , Both players choose actions according to the launch strategy , $a_t \sim p_π(⋅|s_t)$ . When the game reaches its final state , result $z_t= ± r(s_T)$ Calculated from the final score .
Backup （ chart 3d）. In each of the simulations in-tree step $t \leq L$ It's about , to update rollout statistics , It's like it's lost $n_{vl}$ game , $N_r(s_t, a_t)←N_r(s_t, a_t)+n_{vl}$ ; $W_r(s_t, a_t)←W_r(s_t, a_t) -n_{vl}$ ; This virtual loss prevents other threads from exploring the same changes at the same time . At the end of the simulation , At every step $t \leq L$ Update in the reverse transfer of rollout Statistics , Use the results $N_r(s_t, a_t)←N_r(s_t, a_t) -n_{vl}+1$ Replace virtual losses ; $W_r(s_t, a_t)←W_r(s_t, a_t)+n_{vl}+z_t$ . Asynchronously , When the leaf position $s_L$ When the assessment of is completed , A separate reverse transfer will be initiated . The output of the value network $v_θ(s_L)$ Used to update value statistics in the second back propagation , Through each step $t \leq L$ , $N_v(s_t, a_t)←N_v(s_t, a_t)+1$ , $W_v(s_t, a_t) ←W_v(s_t, a_t)+v_θ(s_L)$ . The overall assessment of each state action is Monte Carlo Estimated value $λ$ Weighted average of , It combines value networks with rollout Evaluation and weighting parameters $λ$ Mix it up . All updates are performed without locks .
Expand （ chart 3b）. When the number of visits exceeds the threshold $N_r(s, a)>n_{thr}$ when , Follow up status $s^{'} = f (s, a)$ Add to the search tree . The new node is initialized to ${ N(s', a)=N_r(s', a)=0, W(s', a)=W_r(s', a)=0, P(s',a) =p_σ(a|s') \}$ , Use tree strategy $p_τ(a|s')$ （ Be similar to rollout Strategy , But it has more functions , See extended data table 4） Provide placeholder prior probability for action selection . Location $s^{'}$ Also inserted into a queue , Asynchronous for Policy Networks GPU assessment . A priori probability is determined by SL Policy network ${p_σ}^β(⋅|s′)$ Calculation ,softmax The temperature is set to $β$ ; These replace placeholder prior probabilities with atomic updates $P( s′,a) ← {p_σ}^β(a|s′) . threshold $n_{thr}$ It's dynamically adjusted , To ensure that the rate at which locations are added to the policy queue is the same as GPU Evaluate the speed of the policy network to match . Location is used by strategic networks and Value Networks mini-batch The size is 1 To assess the , To minimize end-to-end evaluation time .
We also implemented distributed APV-MCTS Algorithm . The architecture consists of a single host that performs a master search 、 Perform multiple remote jobs for asynchronous deployment CPU And multiple remote jobs that perform asynchronous policy and value network assessments GPU form . The entire search tree is stored in master On , It only performs each simulated in-tree Stage . Leaf position and execution simulation of the work of the launch phase CPU And work GPU communicate , The latter calculates network characteristics and evaluates strategies and value networks . The prior probability of the policy network is returned to the master node , There they replace the placeholder prior probability of the new extension node . come from rollout The reward and value network output of are returned to master, And back up the original search path .
At the end of the search ,AlphaGo Select the action with the most visits ; Compared with maximizing the action value , This is not very sensitive to outliers . The search tree is reused in subsequent time steps ： The child node corresponding to the playback action becomes the new root node ; The subtree below this subtree and all its statistical data are preserved , And the rest of the tree is discarded . AlphaGo The game version of continues to search while the opponent is moving . If the operation of maximizing access times is inconsistent with the operation of maximizing operation value , Then expand the search . Time control is designed to use most of the time in the middle of the game . When AlphaGo The overall assessment of is lower than estimated 10% The probability of winning the game , namely $max_aQ(s,a)<-0.8$ when ,AlphaGo Will resign .
AlphaGo Not using all the moves or quick action value estimation methods used in most Monte Carlo Go programs ; When using policy networks as prior knowledge , These biased heuristics don't seem to bring any additional benefits . Besides ,AlphaGo Do not use progressive widening 、 dynamic komi Or start . AlphaGo The parameters used in fan Hui's competition are listed in the extended data table 5 in .
Rollout policy. rollout Strategy $p_π(a|s)$ It is based on fast 、 Incremental calculation 、 Linearity based on the characteristics of local patterns softmax Strategy , Including the resulting state $s$ The previous one moves around “ Respond to ” Patterns and surrounding “ No response ” Mode candidate in state $s$ In the mobile $a$ . Each unresponsive pattern is a binary feature , Match with $a$ Centered on specific 3×3 Pattern , By the color of each adjacent intersection （ black 、 white 、 empty ） And free counting （1、2、≥3） Definition . Each response pattern is a binary feature , With the previous move as the center 12 Dot diamond pattern 21 The color in matches the free count . Besides , A small number of handmade local feature codes common sense go rules （ See extended data sheet 4）. Similar to policy networks ,rollout Weight of strategy $π$ It's from Tygem Human games on the server 800 Training in 10000 positions , To maximize log likelihood through random gradient descent . On the empty board , Every CPU Threads execute about per second 1,000 Sub simulation .
Our launch strategy $p_π(a|s)$ It contains less manual knowledge than the most advanced Go program . contrary , We make use of MCTS Medium and higher quality action choices , This is notified by the search tree and Policy Network . We introduced a new technology , You can cache all movements in the search tree , Then play a similar move during the launch ; “ The last good reply ” Heuristic generalization . At every step of tree traversal , The most likely actions are inserted into a hash table , And around the previous move and the current move 3×3 Schema context （ Color 、 Degrees of freedom and stone counting ）. stay rollout Every step of , The schema context will match the hash table ; If a match is found , Then the storage is moved with a high probability .
symmetry . In previous work , By using rotation and reflection invariant filters in the convolution layer Go The symmetry of . Although this may be effective in small neural networks , But it will actually damage the performance of large networks , Because it will prevent the intermediate filter from recognizing specific asymmetric patterns . contrary , We use a dihedral group of eight reflections and rotations $d_1(s), …, d_8(s)$ Dynamically transform each position $s$ Use symmetry at runtime . In explicit symmetric integration , all 8 Small batches of locations are transferred to the strategy network or value network and calculated in parallel . For Value Networks , The output values are simply averaged .
For policy networks , The output probability plane is rotated / Reflect back to the original direction , And averaged together to provide integrated forecasts $σ$ ; This method is used in our original network evaluation （ See extended data sheet 3）. contrary ,APV-MCTS Use implicit symmetric integration , Select a single rotation randomly for each evaluation / Reflection $\in [1, 8]$ . We only calculate one assessment in this direction ; In each simulation , We go through $v_θ(d_j(s_L))$ Calculate leaf nodes $s_L$ Value , And allow the search process to average these assessments . Similarly , We choose the rotation for a single random / Reflective computing strategy network
Policy network ： classification . Our training strategy network $p_σ$ Based on KGS Expert actions in the dataset classify locations . The dataset contains KGS 6 to 9 Human players play 160,000 In a game 2940 10000 locations ; 35.4% Our game is a handicap game . Data sets are divided into test sets （ The first onemillion positions ） Training set （ rest 2840 10000 locations ）. Passing is excluded from the data set . Each position is described by the original chessboard $s$ And the movement of human choice $a$ form . We expanded the data set to include all eight reflections and rotations for each location . Symmetry enhancement and input features are pre calculated for each position . For each training step , We start from enhanced KGS Randomly selected in the data set $m$ A small batch of samples , $\left\{ { {s^k},{a^k}} \right\}_{k = 1}^m$ Asynchronous random gradient descent update is applied to maximize the log likelihood of the action , $\Delta \sigma = \frac{a}{m}\sum\nolimits_{k = 1}^m {\frac{ {\partial \log {p_\sigma }\left( { {a^k}|{s^k}} \right)}}{ {\partial \sigma }}}$ . step $α$ Initialize to 0.003, Every time 8000 Halve 10000 training steps , There is no momentum term , The small batch size is $m = 16$ . Use DistBelief stay 50 individual GPU Apply updates asynchronously on ; exceed 100 The gradient of step is discarded . 3.4 100 million training steps need about 3 weeks .
Policy network ： Reinforcement learning . We further train the strategy network through strategy gradient reinforcement learning . Each iteration involves small batches running in parallel $n$ A game , In the strategy network currently being trained $p_ρ$ And use parameters from previous iterations $ρ^-$ The opponent ${p_ρ}^−$ Between , Random sampling from the pool to increase the stability of training . The weight is initialized to $ρ=ρ^− =σ$ . Every time 500 Sub iteration , We will use the current parameter $ρ$ Add to the opponent pool . Every game in a small batch $i$ Proceed to step $T_i$ End , Then score from the perspective of each player to determine the result $z_t^i = \pm r\left( { {s_{ {T^i}}}} \right)$ . Then replay the game to determine the strategy gradient update , $\Delta \rho = \frac{a}{n}\sum\nolimits_{i = 1}^n {\sum\nolimits_{t = 1}^{ {T^i}} {\frac{ {\partial \log {p_\rho }\left( {a_t^i|{s_t}^i} \right)}}{ {\partial \rho }}} } \left( {z_t^i - v\left( {s_t^i} \right)} \right)$ , Use REINFORCE Algorithm And baseline ${v\left( {s_t^i} \right)}$ For variance reduction . When I first passed the training channel , The baseline is set to zero ; In the second pass , We use value networks $v_θ(s)$ As a baseline ; This provides a small performance improvement . In this way, the strategy network has been trained for a day , Use 50 individual GPU, Yes 128 A game 10,000 A small batch of training .
Value network ： Return to . We trained a value network $v_θ(s) \approx v^{p_ρ}(s)$ To approximate RL Policy network $p_ρ$ The value function of . In order to avoid over fitting strongly related positions in the game , We constructed a new dataset of unrelated self game positions . This dataset contains more than 3000 10000 locations , Each position comes from a unique self game . Every game is sampled by random time steps $\sim unif \{1, 450 \}$ And from SL Before sampling in the policy network $t = 1, \dots U - 1$ Move , $a_t \sim p_σ(·|s_t)$ To generate in three stages ; Then sample a move randomly and evenly from the available moves , $a_U \sim unif \{1, 361 \}$ （ Repeat until $a_U$ legal ）; And then from RL Policy network $a_t \sim p_ρ(·|s_t)$ Sample the remaining moving sequence , Until the end of the game , $t = U + 1, \dots T$ . Last , Rate the game to determine the result $z_t=±r(s_T)$ . Only one training example is added to the dataset of each game $s_{U+1}, z_{U+1})$ . This data provides an unbiased sample of the value function .
In the first two stages of generation , We sample from a noisy distribution , To increase the diversity of data sets . Training methods and SL Strategy network training is the same , The difference is that the parameter update is based on the mean square error between the predicted value and the observed reward ,
Value network usage 50 individual GPU Yes 32 Location 5000 Ten thousand small batches have been trained for a week .
Strategy / The function of value network . Each position $s$ Be pretreated into a group 19×19 The characteristic plane of . The features we use come directly from the original representation of the rules of the game , Indicates the state of each intersection of the go board ： The color of the pieces 、 freedom （ Adjacent empty points of chess chain ）、 Capture 、 Legitimacy 、 The number of rounds since playing chess , and （ Only applicable to value networks ） Current color to play . Besides , We use a simple tactical feature to calculate the results of ladder search . All features are calculated relative to the current color to be played ; for example , The stone color of each intersection is represented by the player or opponent , Not black or white . Each integer eigenvalue is divided into multiple 19×19 Binary value of plane （one-hot encoding）. for example , A separate binary feature plane is used to indicate whether the intersection has 1 individual liberties free 、2 A freedom 、……、≥8 A freedom . The complete feature plane set is listed in the extended data table 2 in .
Neural network architecture . The input of policy network is made by 48 Composed of characteristic planes 19×19×48 Image stack . The first hidden layer zero fills the input with 23×23 Image , And then $k$ Kernel size is 5×5 The filter of is convoluted with the input image , The stride is 1, And apply rectifier nonlinearity . Subsequent hidden layers 2 To 12 Each of them fills its previous hidden layer zero into a 21×21 In the image of , And then $k$ Kernel size is 3×3 Filter and step size 1 Convolution , Follow the nonlinear rectifier again . Finally, the kernel size is 1×1 Of 1 Filters and steps 1 Convolution , Each position has a different deviation , And Application softmax function . AlphaGo The competition version of uses $k = 192$ A filter ; chart 2b And extended data table 3 It also shows the use k = 128、256 and 384 Training results of filters .
The input of value network is also a 19×19×48 Image stack , There is also an additional binary feature plane to describe the current color to be played . Hidden layer 2 To 11 Same as policy network , Hidden layer 12 Is an additional convolution layer , Hidden layer 13 Convolution 1 Kernel size is 1×1 Filter for , The stride is 1, Hidden layer 14 Is a fully connected linear layer , Yes 256 Rectifier units . The output layer has a single tanh The fully connected linear layer of the unit .
assessment . We conduct internal tournaments and measure the performance of each program Elo Grade to evaluate the relative strength of computer Go programs . We use logical functions
Estimation procedure a Defeat the program b Probability , And pass BayesElo The program uses standard constants $c_{elo}=1/400$ Calculated Bayesian logistic regression estimation score $e (\cdot)$ . The scale is based on fan Hui, a professional go player BayesElo score （ The date of submission is 2,908）. All programs can receive up to 5 Second count time ; The competition uses Chinese rules to score , Komi Wei 7.5 branch （ Bonus points to compensate for the second place in white chess ）. We also played let AlphaGo Playing white chess with the existing go program ; For these games , We use a non-standard subsystem , Among them, Komi , But the black extra stone is given on the usual parting point . Use these rules , $K$ The making points of a chess piece is equivalent to giving black chess $K - 1$ Step free step , Instead of using standard none Komi Let's be regular $K - 1 / 2$ A free step . We use these rules because AlphaGo The value network is specially trained for use 7.5 Of komi.
Except for distributed AlphaGo outside , Each computer Go program is executed on its own stand-alone , With the same specifications , Use the latest available version and the best hardware configuration supported by the program （ See extended data sheet 6）. In the figure 4 in , The approximate ranking of computer programs is based on the highest achieved by the program KGS ranking ; however ,KGS The version may be different from the public version .
The game against fan Hui will be arbitrated by an impartial referee . 5 A formal match and 5 An informal match with 7.5 komi、 There is no upper limit and Chinese rules . AlphaGo Respectively by 5-0 and 3-2 Won these games （ chart 6 And extended data table 1）. The time control of the official competition is 1 Hours plus three 30 Second race time . The time control of informal games is three 30 second byoyomi. Before the game, fan Hui chose time control and competition conditions ; It is also agreed that the overall result of the competition will be completely determined by the official competition . In order to roughly evaluate fan Hui's relative rating of computer Go programs , We add the results of all ten games to our internal game results , Ignoring the difference in time control .