当前位置：网站首页>Reinforcement learning - incomplete observation problem, MCTs

Reinforcement learning - incomplete observation problem, MCTs

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Incomplete observation problem
MCTS

Preface

The knowledge points of this article are summarized from 《 Deep learning and intensive learning 》, If there is a mistake , Welcome to point out

Incomplete observation problem

Reinforcement learning algorithm introduced in previous blogs , Agents can know the overall state from the environment , Be similar to MOBA Game like viewing system , You can observe the movements of both sides on the map . But in some cases , The agent can only know the local state from the environment , For example, the king rongyaozhong , The vision mechanism causes each player to see only a small part of the state in the whole map .

set up $t$ The state observed by the agent at any time is $o_t$ , We can only rely on $o_t$ Let the strategic network make decisions , But it usually doesn't work well . A better strategy is to make the strategy network based on the past $t$ Predict the observed state at every moment , That is, the input of the policy network is $o_1、o_2、...、o_t]$ This sequence . have access to RNN、LSTM、Transformer Wait for the model to process such serialized data .

Insert picture description here

As shown in the figure above , With RNN For example , Observed $n$ The state of the moment $o_n$ when , The sequence of $o_1、o_2、...、o_n]$ Input to $n$ individual CNN in , $n$ individual CNN The output of is passing RNN To deal with ,RNN Of the $n$ Output inputs into the fully connected network , The probability of each action performed by the output agent of the fully connected network

MCTS

MCTS（ Monte carlo tree search ） It is a very important method in reinforcement learning , Applied to include Alpha Go、Alpha Go Zero Including many reinforcement learning algorithms .MCTS The basic principle of is to simulate what may happen in the future , So as to find the best action , It can be regarded as a wise enumeration , It also means that MCTS It will consume a lot of computing power .

MCTS Each simulation will choose an action , After performing this action , Play a game to the end , Evaluate the action according to the outcome .MCTS Include options （Selection）、 Expand （expansion）、 simulation （Simulation）、 to flash back （Backup） Four steps . It is worth mentioning that , Strategic networks and value networks will assist MCTS The calculation of , It's going on MCTS front , We need to train the strategic network and value network in advance .

choice

Suppose there is $n$ An action ,MCTS Of “ choice ” The step is to choose the action with a higher chance of winning , The action value and strategy network calculated by simulation $\pi(a|s;\theta)$ The action score given （ Probability value ） To evaluate the action $a$ The stand or fall of . To be specific , It evaluates the quality of an action through the following formula ：
$score(a)=Q(a)+\frac{\alpha}{1+N(a)}\pi(a|s;\theta)\tag{1.0}$
among

$\alpha$ Is a super parameter
$N (a)$ Show action $a$ The number of times it has been selected , Initial stage $N (a) = 0$ , action $a$ Every time you are selected , $N (a) + 1$
$Q (a)$ Before $N (a)$ The action value calculated by this simulation , It is mainly determined by the winning rate and the value function . action $a$ Every time you are selected , It will be updated once $Q (a)$

When the action $a$ Not chosen out of date , $Q (a)$ and $N (a)$ Are all 0, here score(a) It is entirely up to the strategic network . When the action $a$ When you are selected many times , $Q (a)$ Yes score(a) The impact will be greater and greater , namely score(a) The value of depends more on the action value calculated according to the past simulation （ That is, the actual situation , Not the prediction of strategic network ）. coefficient $\frac{\alpha}{1+N(a)}$ One of its functions is to encourage MCTS Select actions that have been executed less times in history , Suppose two actions have similar $Q (a)$ and $\pi(a|s;\theta)$ , be $N (a)$ Smaller actions will be selected .

In the selection stage ,MCTS Alternative 1.0 The action with the largest value , Let agents in Simulator Execute this action in （ It's not a real scene ）.

Expand

The agent completes the first step in the simulator “ choice ” After the selected action , Simulate the next action made by the opponent through the expansion stage .MCTS Through the strategy network trained in advance $\pi(a|s_t;\theta)$ Simulate your opponent , Extract actions according to the output probability of the policy network . It is worth mentioning that ,MCTS After performing the action in the simulator , The simulator needs to return to a new state , Therefore, the simulator needs to simulate the actual state transition function . For most problems , For example, driverless , It is quite difficult to simulate the actual state transition function . But for MOBA Class games 、 Go and other relatively simple problems , After the action made by the opponent , The board 、 The game match is the next state in the simulator , Therefore, it is only necessary to train a strategy network that can simulate the actual situation to simulate the state transition function .

simulation

“ choice ” and “ Expand ” Stage , Both the agent itself and the opponent have chosen an action , And execute in the simulator , Get a new state $s_{t+1}$ . And in the simulation stage , Both sides will simulate the subsequent match through two strategic networks trained in advance , Calculate the status according to the subsequent match $s_{t+1}$ The value of . Take go as an example ,AlphaGo At this stage, the game will be simulated through two strategic networks , Until the game is over . If your side wins , So reward $r$ The values for 1, And vice versa -1. Besides ,AlphaGo The value network trained in advance is introduced $V(s;\theta)$ , Used to evaluate the status $s_{t+1}$ The advantages and disadvantages of .AlphaGo To state $s_{t+1}$ The evaluation of $v(s_{t+1})$ by ：

$v(s_{t+1})=\frac{r+V(s_{t+1};\theta)}{2}$
Because the sampling is based on the probability of the output of the policy network , So in the State $s_{t+1}$ The actions performed by each simulation may be different .MCTS In state $s_{t+1}$ The simulation on will be carried out many times , As shown in the figure below , A state $s_{t+1}$ Contains multiple comments .
Insert picture description here

to flash back

According to the previous section , The first $t$ Step action $a_t$ There are many records about the status under , The draw value of these States is the formula 1.0 Medium $Q (a)$ . This value will be used in the calculation formula of the next iteration 1.0（ The above four processes constitute MCTS An iteration of ,MCTS Will repeat the above four steps many times ）.

Decision making

After thousands of iterations ,MCTS Select in state $t$ The most frequently performed action , namely $a=\argmax_a N(a)$ . After watching MCTS After the principle of , It's not hard to understand why AlphaGo and AlphaGo Zero Can win mankind , In fact, it is crushing by force .

Training of value network and strategy network

As mentioned earlier ,MCTS We need to use the value network and strategy network trained in advance to assist . This section will summarize AlphaGo and AlphaGo Zero How to train two networks .

AlphaGo

AlphaGo The process of training strategy network is

Train the strategy network with supervisory data $\pi(a|s;\theta)$
Set the strategy network that is completely trained in the first step as $\pi(a|s;\theta_{now})$ , Choose a training strategy network randomly $\pi(a|s:\theta_{old})$ （ The parameters are fixed ）. Let the two strategic networks simulate a match , if $\pi(a|s;\theta_{now})$ victory , In return $u$ by 1, Otherwise -1. Then use REINFORCE Update policy network $\pi(a|s;\theta_{now})$ Parameters of , Repeat the process until it converges .

According to the above process , You can get a series of state arrays $s_t,u_t)$ , Using the above data , Let the value network $v(s;\theta)$ Conduct supervised learning , The loss function is MSE：
$L(\theta)=\frac{1}{N}\sum_{t=1}^N[v(s_t;\theta)-u_t]^2$

AlphaGo Zero

AlphaGo Zero Use prepared MCTS（ for example AlphaGo） Simulate the game between two players . Based on the current state $s_t$ ,MCTS After many simulations , You can get $n$ The number of actions executed $N (1) 、 N (2) 、 . . . 、 N (n)$ , Normalize it to get $p_t$ , namely $p_t=normalize([N(1),N(2),...,N(n)])$
set up MCTS Simulated two players pass $m$ End a match after three steps , You can get a series of tracks of two players respectively
$s_1,p_1,u_1)、(s_2,p_2,u_2)、...、(s_m,p_m,u_m)$
For the winner , Return $u_1=u_2=...=u_m=1$ . For losers , Return $u_1=u_2=...=u_m=-1$ , Use the above data to update the strategy network and value network （ Pay attention to the winners and losers , In trajectory data $s$ and $p$ Is different ）

For policy networks $\pi(a|s_t;\theta)$ , It uses supervised learning training , set up H(x,y) Is the cross entropy loss , Then the loss function is
$L(\theta)=\frac{1}{N}\sum_{i=1}^NH(p_i,\pi(a|s_i;\theta))$

How to update the value network AlphaGo Agreement . The specific training process is to repeat the following three steps until the algorithm converges ：

Give Way MCTS Self game , Finish a game , Collect to m A triad $s_1,p_1,u_1)、(s_2,p_2,u_2)、...、(s_m,p_m,u_m)$
According to the above triples , Update policy network parameters
According to the above triples , Update value network parameters