当前位置:网站首页>Reinforcement learning - learning notes 5 | alphago

Reinforcement learning - learning notes 5 | alphago

2022-07-06 20:44:00 climerecho

This article is not a paper reading note , Just a study note , Focus on understanding , It may be a little less rigorous .

AlphaGo Thesis Guide :

  1. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.https://www.nature.com/articles/nature16961
  2. Mastering the game of Go without human knowledge.Nature,2017.https://www.nature.com/articles/nature24270

5. AlphaGo

5.1 Go introduction

If you use the language of reinforcement learning , How to express go :

The standard go board is a 19 × 19 The grid of , altogether 361 A point of intersection .State It's a 19 × 19 × 2 Of tensor.

2 It means The status of this position is The spots / An albino / empty , as everyone knows ,2 Bit binary can be encoded 4 Status

actually state by 19 × 19 × 48 One of the tensor, Recorded more information .

Action Is to put chess pieces in the blank space , Action space \(\mathbb{A}\in\{ 1,2,3,...,361\}\) As the game goes on ,action The space is getting smaller and smaller .

Go is very complex , Every step has 100 More than action , It takes hundreds of steps , So the total may be \(10^{170}\) Power Kind of . Chess “ Deep blue ” The violent search method used , It's not applicable to go , Go is much more complex than chess .

5.2 Main idea

Training

  1. Imitation learning : From the large-scale known match , Simulate human behavior to initialize the policy network , This is a kind of supervised learning , That is, multi classification .
  2. Policy network : Training with strategy gradient algorithm Policy Network, Match through two strategic networks , Train the strategy network with the outcome .
  3. Value network : And Actor-Critic Different , Train first The first 2 Step Policy network , Then use the strategy network to train the value network .

perform ( Play against a chess player )

  • Use Monte Carlo tree search (MCTS), Not a violent search , Used Strategy network and value network to guide search .

5.3 Policy network

For the sake of simplicity , use Relatively new Alpha zero Understand the strategy network in the paper . This network uses 19 × 19 ×17 Of tensor To represent the State state.

explain :

19 × 19 It's the size ,17 Is the number of matrices , The position of the current black chess piece is represented by a matrix , If a sunspot is placed, it is 1, Not placed as 0. Use another 7 Before storing a matrix 7 The sunspot situation of step ; A total of 8 Zhang .

Empathy , Baizi also needs 8 Zhang .

As for why 8 Zhang , It can be considered as a tried out super parameter .

The last matrix is used to represent the black-and-white state , If it's time for sunspots , Is full 1 matrix , And if it's time for a white boy , Is full 0 matrix .

How should strategic networks be designed ?

1 individual tensor According to state , As an input to the policy network , use restnet The extracted features feature, use 1 One or more full connection layers , Output one 361 Dimension vector , In order to satisfy the probability, the sum is , The activation function of the output layer needs to be softmax .

Alpha zero Strategic network is such an idea .

5.4 Imitation learning / Initialize policy network

This step is in 2016 Appears in the version of , Newer Alpha zero There is no use of .

In limine , The parameters of the policy network are initialized randomly , If you skip imitation learning , Direct self game ,agent Will make pure random actions , Therefore, it takes a long time to make the behavior a little reasonable . This is the meaning of imitation learning .

a. Behavior Cloning

Behavior Cloning Not intensive learning , It's a kind of Imitation learning.agent There is no reward , There is no need to reward , Just let the strategy network imitate human actions , More generally speaking, it is the classification regression problem .

The difference between reinforcement learning and reinforcement learning is No reward .

b. Training network

  1. Observe the state of the chessboard \(s_t\), This is a tensor.
  2. Take the state as Policy network \(\pi\) An input to , \(\pi\) Will output one 361 Dimension vector \(p_t=[\pi(1|s_t,\theta),...,\pi(361|s_t,\theta)]\in(0,1)_{361}\), This vector is placed in 361 The probability of each action in each position .
  3. And suppose that the real action of a human chess player is \(a_t^*=281\), Means next 281 The location ; After hot coding , Into a 361 Dimension vector \(y_t\), Only in the first 281 Position as 1, Everything else 0.
  4. use Loss = CrossEntropy(\(y_t,p_t\)) Measure the actions of human players \(y_t\) and forecast \(p_t\) Differences between .
  5. Find the loss function Loss About neural networks, about parameters θ Gradient of . Update the policy network with gradient descent θ.

From this process , It's a multi category , That is to say 361 Categories . This is very similar to image classification .

adopt behavior cloning Training strategy network , In fact, it can defeat amateur players ...... For example, the current status \(s_t\) It has appeared in the training data , Then the strategy network will imitate human players to make predictions \(a_t\) . As long as the quality of training data is high enough , Enough , You can achieve a good effect of playing chess

However ,Behavior Cloning Although easy to do , But the effect is not as good as intensive learning :

namely current state \(s_t\) The situation that does not appear in the training data , At this time policy network Actions made \(a_t\) It won't be very good , And go will accumulate the effect of each step , A strange move , The subsequent chess score is less likely to appear in the training data obtained from human players' games ; Then positive feedback , It's getting worse .

And the situation of go is just very much ,Behavior Cloning The training data used is less than the complete set , So it's easy to fail .

5.5 use RL Continue to train the strategy network

according to 5.4 The final analysis in , In order to cope with the situation that does not exist in the training data , Use reinforcement learning to continue training strategy Networks , Make agent Make more reasonable actions in a certain state .

according to 5.2 Main idea ,AlphaGo Use two strategic networks to play games , One is Player / Agent, The other is Opponent / Environment, The former uses the latest model parameters , Every next round , Rely on rewards to update Player Model parameters of ; The latter is the environment , Parameters do not need to be learned , Then select one of the old parameters as the parameter of the policy network .

a. Rewards

AlphaGo The reward of is defined as :

Suppose a game uses T End of step , When it's not over , The reward for each step is 0. When the game is over , The reward for the last step is +1(win), or -1(lose).

Think about it Return Return The concept of :

\(u_t=\sum_{i=t}^Tr_i\),( No discount )

If agent Win. , that \(u_1=u_2=...=u_T=+1\); On the contrary, if agent lost ,\(u_1=u_2=...=u_T=-1\).

namely : Just win , Make sure every move is good , If you lose , Every step is a bad move .

b. Policy Gradient

RL The algorithm used to train the policy network in step is the policy gradient algorithm .

First, recall the strategic gradient :

Is the state value function \(V(s;\theta)\) About the parameters of the policy network θ Gradient of , The specific derivation is in Chapter III of the intensive learning notes :https://www.cnblogs.com/Roboduster/p/16445811.html

Policy gradient =$ ∂V(s;θ)∂θ=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{Q_\pi(s,a)}$

The definition of action value function is as follows :

\(Q_\pi(s_t,a_t)=\mathbb{E}[U_t|s_t,a_t]\), Random variable \(U_t\) The expectations of the .

If random variables are observed Ut Value ut, You can take ut The approximate Ut The expectations of the :

Policy gradient =$ \frac{\partial{V(s;θ)}}{\partialθ}=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}$

After a game of chess , We can know u Value , Then calculate the strategy gradient .

c. Summary

  • Let the two strategic networks play a game . You can update the model parameters every time you play a game of chess .

    Be careful Player The parameters of each office are updated with policy gradients , and Opponent The parameters of each bureau are sampled randomly from the old parameters .

  • obtain One trajectory( The trajectory ):\(s_1,a_1,s_2,a_2...s_T,a_T\)

  • With the game track and rewards , You can update Player Strategic network :

    Approximate strategy gradient :\(g_\theta=\sum_{t=1}^T\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}\)

  • The strategy gradient rises : \(\theta_{new}\leftarrow \theta_{old}+\beta\cdot{g_\theta}\)

In order to do Monte Carlo tree search MCTS, Value networks are also needed .

5.6 Training value networks

a. Thinking is introduced

Different from the previous value network , The value network here is right State value function V Approximation of , Not right. Q Approximation of .

Review the state value function \(V_\pi = \mathbb{E}[U_t|S_t=s]\), It's a reward \(U_t\) The conditional expectation of . As I said before , The value is only +1 and -1.

Expectation is the state of the future A And future state S Seeking , This eliminates \(s_t\) All random variables except .

\(V_\pi\) The closer the 1 , It means that you are about to win , On the contrary, the closer -1, It means that I'm going to lose .

MCTS need The assistance of value network ( The approximate \(V_\pi\)).

In the original version of AlphaGo in , Value network and strategy network are two separate neural networks . And the newer AlphaGo Zero Let the two share some convolution Conv.

The output of the policy network is a 361 Dimension vector , Each dimension corresponds to the probability of an action ; The output of the value network is a scalar scalar , Reflect the current odds of winning .

Strategy network and value network are trained separately , First, train the strategy network \(\pi\) , Then with the help of strategic network , Training Value network V.

This is also with Actor-Critic Different . because A-C The method requires training both networks at the same time .

b. Training process

  • Let the two strategies network game , Update the network after each round , If win ,u All equal to +1, If lose,u All equal to -1.

    This step is where the strategic network is used .

  • next , Similar to the regression problem ,Loss : \(L =\sum_{t=1}^T\frac{1}{2}[v(s_t;w)-u_t]^2\)

    L Responded to How close the prediction is to the real situation , The brighter the novel, the closer it is to .

  • Then use L About model parameters w Find gradient , Do a gradient descent .\(w\leftarrow w-\alpha\cdot\frac{\partial{L}}{\partial{w}}\)

It's really time to play chess with professional players ,AlphaGo Neither value network nor strategy network is used , But Monte Carlo tree search , Two neural networks are only used to help Monte Carlo tree search .

a. Man machine chess strategy

For human chess players , Playing chess is to make a reasonable decision by looking forward a few more steps . The more masters see the more steps . In those days, dark blue relied on violence to search all situations , To make decisions to defeat mankind .

AlphaGo Also looking forward , Search for possible future states , Find out who has the best chance of winning .

  1. Choose an action at random a, Not uniform sampling , But choose with different probabilities according to the good or bad degree of the action .

    Use the strategy function to eliminate most actions ( Actions with low probability ), These actions are not good enough .

  2. next AlphaGo Look ahead , Strategic network self game , Until the end of the game , Score the action according to the outcome and value function .

  3. Repeat step 2 many times , In this way, each action has many scores , Add up or take an expectation .

  4. Choose the action with the highest score a To execute .

c. MCTS Steps for

There are four steps :

  1. Selection: Choose an action according to the score of the action ( Not yet , Just for the next simulation );
  2. Expansion: Simulate the opponent's actions with strategic network , according to \(\pi\) Choose an action with a high probability to walk ;
  3. Evaluation: to 1 A score will be given for the selected action , One part is that the value network gives a score to the current state v; The other part is the reward for the completion of the strategic network self game r; hold \(\frac{r+a}{2}\) Assign as an average score to a.
  4. Backup:\(\frac{r+a}{2}\) To update the action score .
(1).Selection

In current state \(s_t\) Under the circumstances , Give all actions a score score(a), Reflect the quality of the action :

  • \(score(a) = Q(a)+\eta\frac{\pi(a|s;\theta)}{1+N(a)}\)

explain :

In terms of parameters ;

  • Q(a) Is the score calculated by the search , Action value , It's like a table , Record the score of each action ;
  • η It's a super parameter , Manual adjustment ;
  • \(\pi\) Is the score given by the strategy network ;
  • N(a) It's action a The number of times selected , If the action is good , It is very possible to choose this action , The corresponding \(\pi\) It's worth a lot , And if the a Has been explored many times ,N Bigger , Avoid exploring repeatedly .

At the beginning Q The value is 0, At this time, it is completely controlled by the policy function \(\pi\) To decide which action to explore , And after the number of explorations increases to a great extent , The second item becomes smaller , At this time, the first item Q(a) To decide .

Let's take an example and continue to understand backwards , There are three actions , We choose the one with the greatest probability \(a_t\) Keep going down .( Not implemented , It's just a simulation )

(2).Expansion

Simulate the opponent according to the strategy network , Suppose that \(a_t\) Yes , Use the strategy function to sample an action instead of the opponent's response . At this time, it is not to choose the action with the highest probability , All possible actions have their own probability to be drawn .

Introduce a little mathematical concept :

State transition function \(p(s_{t+1}|s_t,a_t)\)

Opponent's action ( Equivalent to the environment environment) A new state will be created \(s_{t+1}\) , But we don't know how our opponents will respond , So I don't know the action transfer function . We use the policy function \(\pi\) To replace the state transition function .

For example, we selected the probability of 0.4 The action of , Produced a state \(s_{t+1}\)

(3).Evaluation

From the State \(s_{t+1}\) Start , Let the strategic network game itself , Until the winner is decided (Fast Rollout). Get a reward at the end of the game , Win, then \(r_T=+1\); Lose rule \(r_T=-1\).

Reward \(r_T\) Used to evaluate \(s_{t+1}\) The stand or fall of , Win and increase the score , Lose and reduce the score .

meanwhile , Value networks are also used to give \(s_{t+1}\) score , This value network has been trained before . Input \(s_{t+1}\) , Output score \(v(s_{t+1};w)\), It can also reflect \(s_{t+1}\) What is the odds of winning in this situation .

Comprehensive evaluation of two aspects , namely \(V(s_{t+1})=\frac{1}{2}v(s_{t+1};w)+\frac{1}{2}r_T\), reflect \(s_{t+1}\) Good or bad .

(4).Backup

The simulation process will be repeated many times , So there are many records in each state :

And for every action \(a_t\) There will be many such child nodes , Yes \(a_t\) Average all records , As \(a_t\) New values \(Q(a_t)\), As a pair of actions \(a_t\) The evaluation of .

This Q Value is used to select the best in the first step a.

Q The value is initialized to 0, This will be updated every time you search Q.

Here you can review the first step .

When the above four steps are repeated many, many times , At this time of action Good or bad It's already obvious , In this way, we can make a real game decision .、

An action a Of Q Values and \(\pi\) The bigger the value is. , The number of times selected N(a) The greater the , So the action is good or bad through N(a) To reflect .AlphaGo Just choose N Value the biggest action to match .

(5). Matchmaking steps

A human chess player makes a move ,AlphaGo I will do a Monte Carlo tree search again , Put the Q and N Initialize to 0, Repeat the above four steps to make simulation and decision .

5.8 AlphGo Zero VS AlphaGo

Follow up on this part ,AlphaGo Zero Than AlphaGo More powerful , The main difference is this :

  1. AlphaGo Zero Not used Behavior Cloning , Did not learn human experience , But stronger .

  2. When training strategy network , I used MCTS, Let the strategy network imitate MCTS The action made .

    How to use , Subsequent complement .

x. Reference tutorial

原网站

版权声明
本文为[climerecho]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207061234212576.html