This article is not a paper reading note , Just a study note , Focus on understanding , It may be a little less rigorous .
AlphaGo Thesis Guide :
- Mastering the game of Go with deep neural networks and tree search. Nature, 2016.https://www.nature.com/articles/nature16961
- Mastering the game of Go without human knowledge.Nature,2017.https://www.nature.com/articles/nature24270
5. AlphaGo
5.1 Go introduction
If you use the language of reinforcement learning , How to express go :
The standard go board is a 19 × 19 The grid of , altogether 361 A point of intersection .State It's a 19 × 19 × 2 Of tensor.
2 It means The status of this position is The spots / An albino / empty , as everyone knows ,2 Bit binary can be encoded 4 Status
actually state by 19 × 19 × 48 One of the tensor, Recorded more information .
Action Is to put chess pieces in the blank space , Action space \(\mathbb{A}\in\{ 1,2,3,...,361\}\) As the game goes on ,action The space is getting smaller and smaller .
Go is very complex , Every step has 100 More than action , It takes hundreds of steps , So the total may be \(10^{170}\) Power Kind of . Chess “ Deep blue ” The violent search method used , It's not applicable to go , Go is much more complex than chess .
5.2 Main idea
Training
- Imitation learning : From the large-scale known match , Simulate human behavior to initialize the policy network , This is a kind of supervised learning , That is, multi classification .
- Policy network : Training with strategy gradient algorithm Policy Network, Match through two strategic networks , Train the strategy network with the outcome .
- Value network : And Actor-Critic Different , Train first The first 2 Step Policy network , Then use the strategy network to train the value network .
perform ( Play against a chess player )
- Use Monte Carlo tree search (MCTS), Not a violent search , Used Strategy network and value network to guide search .
5.3 Policy network
For the sake of simplicity , use Relatively new Alpha zero Understand the strategy network in the paper . This network uses 19 × 19 ×17 Of tensor To represent the State state.
explain :
19 × 19 It's the size ,17 Is the number of matrices , The position of the current black chess piece is represented by a matrix , If a sunspot is placed, it is 1, Not placed as 0. Use another 7 Before storing a matrix 7 The sunspot situation of step ; A total of 8 Zhang .
Empathy , Baizi also needs 8 Zhang .
As for why 8 Zhang , It can be considered as a tried out super parameter .
The last matrix is used to represent the black-and-white state , If it's time for sunspots , Is full 1 matrix , And if it's time for a white boy , Is full 0 matrix .
How should strategic networks be designed ?
1 individual tensor According to state , As an input to the policy network , use restnet The extracted features feature, use 1 One or more full connection layers , Output one 361 Dimension vector , In order to satisfy the probability, the sum is , The activation function of the output layer needs to be softmax .
Alpha zero Strategic network is such an idea .
5.4 Imitation learning / Initialize policy network
This step is in 2016 Appears in the version of , Newer Alpha zero There is no use of .
In limine , The parameters of the policy network are initialized randomly , If you skip imitation learning , Direct self game ,agent Will make pure random actions , Therefore, it takes a long time to make the behavior a little reasonable . This is the meaning of imitation learning .
a. Behavior Cloning
Behavior Cloning Not intensive learning , It's a kind of Imitation learning.agent There is no reward , There is no need to reward , Just let the strategy network imitate human actions , More generally speaking, it is the classification regression problem .
The difference between reinforcement learning and reinforcement learning is No reward .
b. Training network
- Observe the state of the chessboard \(s_t\), This is a tensor.
- Take the state as Policy network \(\pi\) An input to , \(\pi\) Will output one 361 Dimension vector \(p_t=[\pi(1|s_t,\theta),...,\pi(361|s_t,\theta)]\in(0,1)_{361}\), This vector is placed in 361 The probability of each action in each position .
- And suppose that the real action of a human chess player is \(a_t^*=281\), Means next 281 The location ; After hot coding , Into a 361 Dimension vector \(y_t\), Only in the first 281 Position as 1, Everything else 0.
- use Loss = CrossEntropy(\(y_t,p_t\)) Measure the actions of human players \(y_t\) and forecast \(p_t\) Differences between .
- Find the loss function Loss About neural networks, about parameters θ Gradient of . Update the policy network with gradient descent θ.
From this process , It's a multi category , That is to say 361 Categories . This is very similar to image classification .
adopt behavior cloning Training strategy network , In fact, it can defeat amateur players ...... For example, the current status \(s_t\) It has appeared in the training data , Then the strategy network will imitate human players to make predictions \(a_t\) . As long as the quality of training data is high enough , Enough , You can achieve a good effect of playing chess
However ,Behavior Cloning Although easy to do , But the effect is not as good as intensive learning :
namely current state \(s_t\) The situation that does not appear in the training data , At this time policy network Actions made \(a_t\) It won't be very good , And go will accumulate the effect of each step , A strange move , The subsequent chess score is less likely to appear in the training data obtained from human players' games ; Then positive feedback , It's getting worse .
And the situation of go is just very much ,Behavior Cloning The training data used is less than the complete set , So it's easy to fail .
5.5 use RL Continue to train the strategy network
according to 5.4 The final analysis in , In order to cope with the situation that does not exist in the training data , Use reinforcement learning to continue training strategy Networks , Make agent Make more reasonable actions in a certain state .
according to 5.2 Main idea ,AlphaGo Use two strategic networks to play games , One is Player / Agent, The other is Opponent / Environment, The former uses the latest model parameters , Every next round , Rely on rewards to update Player Model parameters of ; The latter is the environment , Parameters do not need to be learned , Then select one of the old parameters as the parameter of the policy network .
a. Rewards
AlphaGo The reward of is defined as :
Suppose a game uses T End of step , When it's not over , The reward for each step is 0. When the game is over , The reward for the last step is +1(win), or -1(lose).
Think about it Return Return The concept of :
\(u_t=\sum_{i=t}^Tr_i\),( No discount )
If agent Win. , that \(u_1=u_2=...=u_T=+1\); On the contrary, if agent lost ,\(u_1=u_2=...=u_T=-1\).
namely : Just win , Make sure every move is good , If you lose , Every step is a bad move .
b. Policy Gradient
RL The algorithm used to train the policy network in step is the policy gradient algorithm .
First, recall the strategic gradient :
Is the state value function \(V(s;\theta)\) About the parameters of the policy network θ Gradient of , The specific derivation is in Chapter III of the intensive learning notes :https://www.cnblogs.com/Roboduster/p/16445811.html
Policy gradient =$ ∂V(s;θ)∂θ=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{Q_\pi(s,a)}$
The definition of action value function is as follows :
\(Q_\pi(s_t,a_t)=\mathbb{E}[U_t|s_t,a_t]\), Random variable \(U_t\) The expectations of the .
If random variables are observed Ut Value ut, You can take ut The approximate Ut The expectations of the :
Policy gradient =$ \frac{\partial{V(s;θ)}}{\partialθ}=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}$
After a game of chess , We can know ut Value , Then calculate the strategy gradient .
c. Summary
Let the two strategic networks play a game . You can update the model parameters every time you play a game of chess .
Be careful Player The parameters of each office are updated with policy gradients , and Opponent The parameters of each bureau are sampled randomly from the old parameters .
obtain One trajectory( The trajectory ):\(s_1,a_1,s_2,a_2...s_T,a_T\)
With the game track and rewards , You can update Player Strategic network :
Approximate strategy gradient :\(g_\theta=\sum_{t=1}^T\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}\)
The strategy gradient rises : \(\theta_{new}\leftarrow \theta_{old}+\beta\cdot{g_\theta}\)
In order to do Monte Carlo tree search MCTS, Value networks are also needed .
5.6 Training value networks
a. Thinking is introduced
Different from the previous value network , The value network here is right State value function V Approximation of , Not right. Q Approximation of .
Review the state value function \(V_\pi = \mathbb{E}[U_t|S_t=s]\), It's a reward \(U_t\) The conditional expectation of . As I said before , The value is only +1 and -1.
Expectation is the state of the future A And future state S Seeking , This eliminates \(s_t\) All random variables except .
\(V_\pi\) The closer the 1 , It means that you are about to win , On the contrary, the closer -1, It means that I'm going to lose .
MCTS need The assistance of value network ( The approximate \(V_\pi\)).
In the original version of AlphaGo in , Value network and strategy network are two separate neural networks . And the newer AlphaGo Zero Let the two share some convolution Conv.
The output of the policy network is a 361 Dimension vector , Each dimension corresponds to the probability of an action ; The output of the value network is a scalar scalar , Reflect the current odds of winning .
Strategy network and value network are trained separately , First, train the strategy network \(\pi\) , Then with the help of strategic network , Training Value network V.
This is also with Actor-Critic Different . because A-C The method requires training both networks at the same time .
b. Training process
Let the two strategies network game , Update the network after each round , If win ,u All equal to +1, If lose,u All equal to -1.
This step is where the strategic network is used .
next , Similar to the regression problem ,Loss : \(L =\sum_{t=1}^T\frac{1}{2}[v(s_t;w)-u_t]^2\)
L Responded to How close the prediction is to the real situation , The brighter the novel, the closer it is to .
Then use L About model parameters w Find gradient , Do a gradient descent .\(w\leftarrow w-\alpha\cdot\frac{\partial{L}}{\partial{w}}\)
5.7 Monte carlo tree search MCTS
It's really time to play chess with professional players ,AlphaGo Neither value network nor strategy network is used , But Monte Carlo tree search , Two neural networks are only used to help Monte Carlo tree search .
a. Man machine chess strategy
For human chess players , Playing chess is to make a reasonable decision by looking forward a few more steps . The more masters see the more steps . In those days, dark blue relied on violence to search all situations , To make decisions to defeat mankind .
AlphaGo Also looking forward , Search for possible future states , Find out who has the best chance of winning .
b. Search ideas
Choose an action at random a, Not uniform sampling , But choose with different probabilities according to the good or bad degree of the action .
Use the strategy function to eliminate most actions ( Actions with low probability ), These actions are not good enough .
next AlphaGo Look ahead , Strategic network self game , Until the end of the game , Score the action according to the outcome and value function .
Repeat step 2 many times , In this way, each action has many scores , Add up or take an expectation .
Choose the action with the highest score a To execute .
c. MCTS Steps for
There are four steps :
- Selection: Choose an action according to the score of the action ( Not yet , Just for the next simulation );
- Expansion: Simulate the opponent's actions with strategic network , according to \(\pi\) Choose an action with a high probability to walk ;
- Evaluation: to 1 A score will be given for the selected action , One part is that the value network gives a score to the current state v; The other part is the reward for the completion of the strategic network self game r; hold \(\frac{r+a}{2}\) Assign as an average score to a.
- Backup:\(\frac{r+a}{2}\) To update the action score .
(1).Selection
In current state \(s_t\) Under the circumstances , Give all actions a score score(a), Reflect the quality of the action :
- \(score(a) = Q(a)+\eta\frac{\pi(a|s;\theta)}{1+N(a)}\)
explain :
In terms of parameters ;
- Q(a) Is the score calculated by the search , Action value , It's like a table , Record the score of each action ;
- η It's a super parameter , Manual adjustment ;
- \(\pi\) Is the score given by the strategy network ;
- N(a) It's action a The number of times selected , If the action is good , It is very possible to choose this action , The corresponding \(\pi\) It's worth a lot , And if the a Has been explored many times ,N Bigger , Avoid exploring repeatedly .
At the beginning Q The value is 0, At this time, it is completely controlled by the policy function \(\pi\) To decide which action to explore , And after the number of explorations increases to a great extent , The second item becomes smaller , At this time, the first item Q(a) To decide .
Let's take an example and continue to understand backwards , There are three actions , We choose the one with the greatest probability \(a_t\) Keep going down .( Not implemented , It's just a simulation )
(2).Expansion
Simulate the opponent according to the strategy network , Suppose that \(a_t\) Yes , Use the strategy function to sample an action instead of the opponent's response . At this time, it is not to choose the action with the highest probability , All possible actions have their own probability to be drawn .
Introduce a little mathematical concept :
State transition function \(p(s_{t+1}|s_t,a_t)\):
Opponent's action ( Equivalent to the environment environment) A new state will be created \(s_{t+1}\) , But we don't know how our opponents will respond , So I don't know the action transfer function . We use the policy function \(\pi\) To replace the state transition function .
For example, we selected the probability of 0.4 The action of , Produced a state \(s_{t+1}\):
(3).Evaluation
From the State \(s_{t+1}\) Start , Let the strategic network game itself , Until the winner is decided (Fast Rollout). Get a reward at the end of the game , Win, then \(r_T=+1\); Lose rule \(r_T=-1\).
Reward \(r_T\) Used to evaluate \(s_{t+1}\) The stand or fall of , Win and increase the score , Lose and reduce the score .
meanwhile , Value networks are also used to give \(s_{t+1}\) score , This value network has been trained before . Input \(s_{t+1}\) , Output score \(v(s_{t+1};w)\), It can also reflect \(s_{t+1}\) What is the odds of winning in this situation .
Comprehensive evaluation of two aspects , namely \(V(s_{t+1})=\frac{1}{2}v(s_{t+1};w)+\frac{1}{2}r_T\), reflect \(s_{t+1}\) Good or bad .
(4).Backup
The simulation process will be repeated many times , So there are many records in each state :
And for every action \(a_t\) There will be many such child nodes , Yes \(a_t\) Average all records , As \(a_t\) New values \(Q(a_t)\), As a pair of actions \(a_t\) The evaluation of .
This Q Value is used to select the best in the first step a.
Q The value is initialized to 0, This will be updated every time you search Q.
Here you can review the first step .
When the above four steps are repeated many, many times , At this time of action Good or bad It's already obvious , In this way, we can make a real game decision .、
An action a Of Q Values and \(\pi\) The bigger the value is. , The number of times selected N(a) The greater the , So the action is good or bad through N(a) To reflect .AlphaGo Just choose N Value the biggest action to match .
(5). Matchmaking steps
A human chess player makes a move ,AlphaGo I will do a Monte Carlo tree search again , Put the Q and N Initialize to 0, Repeat the above four steps to make simulation and decision .
5.8 AlphGo Zero VS AlphaGo
Follow up on this part ,AlphaGo Zero Than AlphaGo More powerful , The main difference is this :
AlphaGo Zero Not used Behavior Cloning , Did not learn human experience , But stronger .
When training strategy network , I used MCTS, Let the strategy network imitate MCTS The action made .
How to use , Subsequent complement .
x. Reference tutorial
- Video Course : Deep reinforcement learning ( whole )_ Bili, Bili _bilibili
- Video original address :https://www.youtube.com/user/wsszju
- Courseware address :https://github.com/wangshusen/DeepLearning