当前位置：网站首页>[Paper reading] Mastering the game of Go with deep neural networks and tree search

[Paper reading] Mastering the game of Go with deep neural networks and tree search

2022-07-31 03:36:00 【meet the devil】

【论文阅读】Mastering the game of Go with deep neural networks and tree search

1 本文解决了什么问题？

在所有的 完全信息博弈 中,There is an optimal value function $v^{*}(s)$ ,It determines the state $s$ The game of the final output.Complete information game by recursively computing the optimal value function in the search tree to solve.But a search tree contains a $b^d$ A possible game sequence（ $b$ For the action space size,The branches of the tree number; $d$ Length for the game,树的深度）,If in big game scene,Search space will be too big（在围棋中 $b = 250, d = 150$ ）,Therefore exhaustive search is not realistic.

At present in the field of the game's strongest technology is based on MCTS 的,And through the training of human experts to predict the fewest strategy to strengthen.These strategies for action to reduce the search scope some high probability,And during the sampling sampling was carried out on the action.但是,Previous research is limited to Based on the input characteristics of linear combination Shallow strategy or value of the function.CNN 在视觉领域大放异彩,With a strong feature extraction ability.So this article in the game a similar architecture,Will build board for a $19 \times 19$ 的图像,To build and use the convolution layer location said.

总而言之,This article USES the value network to estimate the position,Use strategy network to sampling action.The depth of neural network by combining The supervision of experts from the human games to learn 以及 Reinforcement learning from the game 来进行训练.在没有任何前瞻搜索的情况下,Neural network chess level corresponds to the most advanced monte carlo simulation tree search program thousands of randomly from the game.此外,This paper puts forward a new search algorithm,Combining monte carlo simulation and value and policy network.Use the search algorithm,本文程序 AlphaGo For other go programs since 99.8%,并以 5 比 0 击败了人类欧洲围棋冠军.

2 解决方法是什么？

In this paper, the neural network training stage as shown in the figure below,即：

首先,From a human expert action in a supervised learning strategy training network $p_{\sigma}$ ;
同时,Training a fast move strategy $p_\pi$ ,So that in the simulation stage fast sampling action;
下一步,A reinforcement learning strategy training network $p_\rho$ ,The result of the game by optimizing the network to improve the supervision and learning strategy $p_{\sigma}$ ;
（This will make supervision and learning strategy network towards the win,Rather than to maximize the accuracy of the prediction）
最后,Training a network of value $v_{\theta}$ ,It can predict RL The winner of the policy network and since the game.

2.1 策略网络的监督学习

The first stage of the training process is the use of supervised learning strategies to initialize the network.监督学习策略网络 $p_{\sigma}(a|s)$ Convolution and nonlinear activation layer between the parameters of the as $\sigma$ ,Then connected to a softmax 层输出所有合法动作 $a$ 的概率分布.网络的输入 $s$ Is the state board of simple said,By randomly sampling status-动作对 $(s, a)$ ,Using the stochastic gradient descent maximization in the state $s$ 下选择动作 $a$ 的可能性.
$\Delta \sigma \propto \frac{\partial \log p_{\sigma}(a \mid s)}{\partial \sigma}$
作者训练了一个 13 Supervised learning strategy network layer.When the author use all input characteristics,Supervised learning strategy network to predict human expert movements on the test set（落子）准确率为 57.0%.When using only the original board and move the accuracy in history as the input for 55.7%,相比之下,The most advanced level in the work for 44.4%.

Supervised learning strategy network prediction accuracy of tiny ascend it will boost the game ability,The more exact winning percentage is higher,如下图所示.

但是,Although the larger network can bring better accuracy,But in the search process evaluation is relatively slow.因此,The author training a faster,But not very accurate 快速走子网络 $p_{\pi}(a|s)$ ,The network parameters as $\pi$ .它实现了 24.2% 的准确度,仅使用 2μs 来选择操作（Supervised learning strategy network need to 3ms）.

2.2 策略网络的强化学习

The second phase of training through policy gradient reinforcement learning strategy to improve the network.强化学习策略网络 $p_{\rho}$ And supervised learning strategy network has the same structure,At the same time the weight of it $\rho$ Is initialized to supervised learning strategy network weight value,即 $\rho=\sigma$ .The strategy of the author in the current network $p_{\rho}$ Iteration strategy and random selection of history between the network game.In this way, randomly selected from a rival, in the pool,Can prevent excessive fitting the current strategy,从而稳定训练.

Step for all the end of time $t < T$ ,The authors use the reward function $r (s)$ 为零.结果 $z_t =±r(s_T)$ For the current players in time steps $t$ The ultimate reward at the end of the game：1 代表赢,−1 代表输.然后,在每个时间步 $t$ 中,By rising along the direction of the expected utility maximization of stochastic gradient to update the weight.
$\Delta \rho \propto \frac{\partial \log p_{\rho}\left(a_{t} \mid s_{t}\right)}{\partial \rho} z_{t}$
Through the action on the probability distribution action sampling $a_{t} \sim p_{\rho}\left(\cdot \mid s_{t}\right)$ To assess the reinforcement learning strategy network performance in the game.When carries on the confrontation,Reinforcement learning strategies over supervision and learning strategy network game winning 80%.

2.3 Value network of reinforcement learning

The last stage of the training process focuses on the assessment board,估计一个价值函数 $v^p(s)$ ,Both sides of the function prediction using strategy $p$ For the position of the game $s$ 的结果.
$v^{p}(s)=\mathbb{E}\left[z_{t} \mid s_{t}=s, a_{t \ldots T} \sim p\right]$
The author used to take parameters $\theta$ The value of the network $v_{\theta}(s)$ 来近似价值函数.Value network structure and strategy of the network structure is the same,But its output is a single scalar forecast,而不是概率分布.The author through to the state-结果对 $(s, z)$ The return of the training value of network weights,Using the stochastic gradient descent to minimize forecast $v_θ(s)$ 与相应结果 $z$ 之间的均方误差（MSE）.
$\Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}\left(z-v_{\theta}(s)\right)$
If you use the full game data to predict the result of simple method will lead to a fitting.原因在于,Continuous board position is closely related to,The difference is only one piece,But back to target in the whole game is the same.When on training data set in this way,Remember the value network game results,Instead of generalization to the new board position,On the test set minimumMSE为 0.37,而在训练集上为 0.19.为了缓解这个问题,This paper produces a new since the game data set,包含 3000万 Different board position,Every position from a separation rate of sampling in the game.Every game inRLStrategy between network and its own,直到游戏终止.The training data set,训练集和测试集的MSEs分别为 0.226 和 0.234,Before fitting the minimum.

The diagram below shows the value of network board position evaluation accuracy,With rapid move strategy $p_π$ Compared with the monte carlo simulation,Value function is always more accurate.利用RL策略网络 $p_ρ$ 对 $v_θ(s)$ A single assessment was close to the accuracy of monte carlo simulation,但使用了15000倍的计算量.

2.4 使用策略和价值网络进行搜索

AlphaGo Combination of strategy and value network in aMCTS算法中,The algorithm through the search action selection in advance.

Each side in the search tree $(s, a)$ ,Store the value an action $Q (s, a)$ ,访问次数 $N (s, a)$ 和先验概率 $P (s, a)$ .Simulation traversal tree（That there is no backup complete game down traversal tree）从根状态开始.Every step of the simulation every time $t$ 中,从状态 $s_t$ 中选择一个动作 $a_t$ ,To maximize the action value and reward the sum.
$a_{t}=\underset{a}{\operatorname{argmax}}\left(Q\left(s_{t}, a\right)+u\left(s_{t}, a\right)\right)$
其中,
$\propto \frac{P(s, a)}{1+N(s, a)}$
It is proportional to the prior probability,But gradually with the increase of access frequency attenuation to encourage exploration.当在步骤 $L$ When reach the leaf node $s_L$ ,

Leaf nodes can扩展.Leaf node board position by the supervised learning strategy network $p_{\sigma}$ 只处理一次,For each legal action $a$ The probability of the output is stored as a priori probability $P$ , $a)=p_{\sigma}(a \mid s)$ .

叶节点的评估有两种不同的方式：首先,By the value network $v_θ(s_L)$ 进行评估;第二,Through the stochastic simulation results $z_L$ 进行评估,Use quick move strategy $p_π$ ,直到终端步骤 $T$ ;Using hybrid parameter $λ$ ,Will this combination to the leaf node evaluation $V(s_L)$ 中.
$V\left(s_{L}\right)=(1-\lambda) v_{\theta}\left(s_{L}\right)+\lambda z_{L}$
在模拟结束时,Update all traverse the action values of the edge and visited.Each edge will be accumulated after all, on the edge of the simulation and the average number of visits：
$\begin{aligned} &N(s, a)=\sum_{i=1}^{n} 1(s, a, i) \\ &Q(s, a)=\frac{1}{N(s, a)} \sum_{i=1}^{n} 1(s, a, i) V\left(s_{L}^{i}\right) \end{aligned}$
其中 $s^i_L$ 是第 $i$ 次模拟的叶节点, $1 (s, a, i$ ) 表示第 $i$ Whether the time simulation traversing an edge $(s, a)$ .一旦搜索完成,Algorithm from the root location selection of the most visited board action.