Reinforcement learning - learning notes 5

This article is not a paper reading note , Just a study note , Focus on understanding , It may be a little less rigorous .

AlphaGo Thesis Guide ：

Mastering the game of Go with deep neural networks and tree search. Nature, 2016.https://www.nature.com/articles/nature16961
Mastering the game of Go without human knowledge.Nature,2017.https://www.nature.com/articles/nature24270

5. AlphaGo

5.1 Go introduction

If you use the language of reinforcement learning , How to express go ：

The standard go board is a 19 × 19 The grid of , altogether 361 A point of intersection .State It's a 19 × 19 × 2 Of tensor.

2 It means The status of this position is The spots / An albino / empty , as everyone knows ,2 Bit binary can be encoded 4 Status
actually state by 19 × 19 × 48 One of the tensor, Recorded more information .

Action Is to put chess pieces in the blank space , Action space $\mathbb{A}\in\{ 1,2,3,...,361\}$ As the game goes on ,action The space is getting smaller and smaller .

Go is very complex , Every step has 100 More than action , It takes hundreds of steps , So the total may be $10^{170}$ Power Kind of . Chess “ Deep blue ” The violent search method used , It's not applicable to go , Go is much more complex than chess .

5.2 Main idea

Training

Imitation learning ： From the large-scale known match , Simulate human behavior to initialize the policy network , This is a kind of supervised learning , That is, multi classification .
Policy network ： Training with strategy gradient algorithm Policy Network, Match through two strategic networks , Train the strategy network with the outcome .
Value network ： And Actor-Critic Different , Train first The first 2 Step Policy network , Then use the strategy network to train the value network .

perform （ Play against a chess player ）

Use Monte Carlo tree search (MCTS), Not a violent search , Used Strategy network and value network to guide search .

5.3 Policy network

For the sake of simplicity , use Relatively new Alpha zero Understand the strategy network in the paper . This network uses 19 × 19 ×17 Of tensor To represent the State state.

explain ：
19 × 19 It's the size ,17 Is the number of matrices , The position of the current black chess piece is represented by a matrix , If a sunspot is placed, it is 1, Not placed as 0. Use another 7 Before storing a matrix 7 The sunspot situation of step ; A total of 8 Zhang .
Empathy , Baizi also needs 8 Zhang .
As for why 8 Zhang , It can be considered as a tried out super parameter .
The last matrix is used to represent the black-and-white state , If it's time for sunspots , Is full 1 matrix , And if it's time for a white boy , Is full 0 matrix .

How should strategic networks be designed ？

1 individual tensor According to state , As an input to the policy network , use restnet The extracted features feature, use 1 One or more full connection layers , Output one 361 Dimension vector , In order to satisfy the probability, the sum is , The activation function of the output layer needs to be softmax .

Alpha zero Strategic network is such an idea .

5.4 Imitation learning / Initialize policy network

This step is in 2016 Appears in the version of , Newer Alpha zero There is no use of .

In limine , The parameters of the policy network are initialized randomly , If you skip imitation learning , Direct self game ,agent Will make pure random actions , Therefore, it takes a long time to make the behavior a little reasonable . This is the meaning of imitation learning .

a. Behavior Cloning

Behavior Cloning Not intensive learning , It's a kind of Imitation learning.agent There is no reward , There is no need to reward , Just let the strategy network imitate human actions , More generally speaking, it is the classification regression problem .

The difference between reinforcement learning and reinforcement learning is No reward .

b. Training network

Observe the state of the chessboard $s_t$, This is a tensor.
Take the state as Policy network $\pi$ An input to , $\pi$ Will output one 361 Dimension vector $p_t=[\pi(1|s_t,\theta),...,\pi(361|s_t,\theta)]\in(0,1)_{361}$, This vector is placed in 361 The probability of each action in each position .
And suppose that the real action of a human chess player is $a_t^*=281$, Means next 281 The location ; After hot coding , Into a 361 Dimension vector $y_t$, Only in the first 281 Position as 1, Everything else 0.
use Loss = CrossEntropy($y_t,p_t$) Measure the actions of human players $y_t$ and forecast $p_t$ Differences between .
Find the loss function Loss About neural networks, about parameters θ Gradient of . Update the policy network with gradient descent θ.

From this process , It's a multi category , That is to say 361 Categories . This is very similar to image classification .

adopt behavior cloning Training strategy network , In fact, it can defeat amateur players ...... For example, the current status $s_t$ It has appeared in the training data , Then the strategy network will imitate human players to make predictions $a_t$ . As long as the quality of training data is high enough , Enough , You can achieve a good effect of playing chess

However ,Behavior Cloning Although easy to do , But the effect is not as good as intensive learning ：
namely current state $s_t$ The situation that does not appear in the training data , At this time policy network Actions made $a_t$ It won't be very good , And go will accumulate the effect of each step , A strange move , The subsequent chess score is less likely to appear in the training data obtained from human players' games ; Then positive feedback , It's getting worse .
And the situation of go is just very much ,Behavior Cloning The training data used is less than the complete set , So it's easy to fail .

5.5 use RL Continue to train the strategy network

according to 5.4 The final analysis in , In order to cope with the situation that does not exist in the training data , Use reinforcement learning to continue training strategy Networks , Make agent Make more reasonable actions in a certain state .

according to 5.2 Main idea ,AlphaGo Use two strategic networks to play games , One is Player / Agent, The other is Opponent / Environment, The former uses the latest model parameters , Every next round , Rely on rewards to update Player Model parameters of ; The latter is the environment , Parameters do not need to be learned , Then select one of the old parameters as the parameter of the policy network .

a. Rewards

AlphaGo The reward of is defined as ：

Suppose a game uses T End of step , When it's not over , The reward for each step is 0. When the game is over , The reward for the last step is +1(win), or -1(lose).

Think about it Return Return The concept of ：
$u_t=\sum_{i=t}^Tr_i$,（ No discount ）

If agent Win. , that $u_1=u_2=...=u_T=+1$; On the contrary, if agent lost ,$u_1=u_2=...=u_T=-1$.

namely ： Just win , Make sure every move is good , If you lose , Every step is a bad move .

b. Policy Gradient

RL The algorithm used to train the policy network in step is the policy gradient algorithm .

First, recall the strategic gradient ：
Is the state value function $V(s;\theta)$ About the parameters of the policy network θ Gradient of , The specific derivation is in Chapter III of the intensive learning notes ：https://www.cnblogs.com/Roboduster/p/16445811.html
Policy gradient =$ ∂V(s;θ)∂θ=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{Q_\pi(s,a)}$
The definition of action value function is as follows ：
$Q_\pi(s_t,a_t)=\mathbb{E}[U_t|s_t,a_t]$, Random variable $U_t$ The expectations of the .
If random variables are observed U_t Value u_t, You can take u_t The approximate U_t The expectations of the :
Policy gradient =$ \frac{\partial{V(s;θ)}}{\partialθ}=\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}$

After a game of chess , We can know u_t Value , Then calculate the strategy gradient .

c. Summary

Let the two strategic networks play a game . You can update the model parameters every time you play a game of chess .
Be careful Player The parameters of each office are updated with policy gradients , and Opponent The parameters of each bureau are sampled randomly from the old parameters .
obtain One trajectory( The trajectory )：$s_1,a_1,s_2,a_2...s_T,a_T$
With the game track and rewards , You can update Player Strategic network ：
Approximate strategy gradient ：$g_\theta=\sum_{t=1}^T\frac{\partial\log\pi(A|s;θ)}{\partial\theta}\cdot{u_t}$
The strategy gradient rises ： $\theta_{new}\leftarrow \theta_{old}+\beta\cdot{g_\theta}$

In order to do Monte Carlo tree search MCTS, Value networks are also needed .

5.6 Training value networks

a. Thinking is introduced

Different from the previous value network , The value network here is right State value function V Approximation of , Not right. Q Approximation of .

Review the state value function $V_\pi = \mathbb{E}[U_t|S_t=s]$, It's a reward $U_t$ The conditional expectation of . As I said before , The value is only +1 and -1.
Expectation is the state of the future A And future state S Seeking , This eliminates $s_t$ All random variables except .
$V_\pi$ The closer the 1 , It means that you are about to win , On the contrary, the closer -1, It means that I'm going to lose .

MCTS need The assistance of value network （ The approximate $V_\pi$）.

In the original version of AlphaGo in , Value network and strategy network are two separate neural networks . And the newer AlphaGo Zero Let the two share some convolution Conv.

The output of the policy network is a 361 Dimension vector , Each dimension corresponds to the probability of an action ; The output of the value network is a scalar scalar , Reflect the current odds of winning .

Strategy network and value network are trained separately , First, train the strategy network $\pi$ , Then with the help of strategic network , Training Value network V.

This is also with Actor-Critic Different . because A-C The method requires training both networks at the same time .

b. Training process

Let the two strategies network game , Update the network after each round , If win ,u All equal to +1, If lose,u All equal to -1.
This step is where the strategic network is used .
next , Similar to the regression problem ,Loss : $L =\sum_{t=1}^T\frac{1}{2}[v(s_t;w)-u_t]^2$
L Responded to How close the prediction is to the real situation , The brighter the novel, the closer it is to .
Then use L About model parameters w Find gradient , Do a gradient descent .$w\leftarrow w-\alpha\cdot\frac{\partial{L}}{\partial{w}}$

5.7 Monte carlo tree search MCTS

It's really time to play chess with professional players ,AlphaGo Neither value network nor strategy network is used , But Monte Carlo tree search , Two neural networks are only used to help Monte Carlo tree search .

a. Man machine chess strategy

For human chess players , Playing chess is to make a reasonable decision by looking forward a few more steps . The more masters see the more steps . In those days, dark blue relied on violence to search all situations , To make decisions to defeat mankind .

AlphaGo Also looking forward , Search for possible future states , Find out who has the best chance of winning .

b. Search ideas

Choose an action at random a, Not uniform sampling , But choose with different probabilities according to the good or bad degree of the action .
Use the strategy function to eliminate most actions （ Actions with low probability ）, These actions are not good enough .
next AlphaGo Look ahead , Strategic network self game , Until the end of the game , Score the action according to the outcome and value function .
Repeat step 2 many times , In this way, each action has many scores , Add up or take an expectation .
Choose the action with the highest score a To execute .

c. MCTS Steps for

There are four steps ：

Selection： Choose an action according to the score of the action （ Not yet , Just for the next simulation ）;
Expansion： Simulate the opponent's actions with strategic network , according to $\pi$ Choose an action with a high probability to walk ;
Evaluation： to 1 A score will be given for the selected action , One part is that the value network gives a score to the current state v; The other part is the reward for the completion of the strategic network self game r; hold $\frac{r+a}{2}$ Assign as an average score to a.
Backup：$\frac{r+a}{2}$ To update the action score .

(1).Selection

In current state $s_t$ Under the circumstances , Give all actions a score score(a), Reflect the quality of the action ：

$score(a) = Q(a)+\eta\frac{\pi(a|s;\theta)}{1+N(a)}$

explain ：
In terms of parameters ;
Q(a) Is the score calculated by the search , Action value , It's like a table , Record the score of each action ;
η It's a super parameter , Manual adjustment ;
$\pi$ Is the score given by the strategy network ;
N(a) It's action a The number of times selected , If the action is good , It is very possible to choose this action , The corresponding $\pi$ It's worth a lot , And if the a Has been explored many times ,N Bigger , Avoid exploring repeatedly .
At the beginning Q The value is 0, At this time, it is completely controlled by the policy function $\pi$ To decide which action to explore , And after the number of explorations increases to a great extent , The second item becomes smaller , At this time, the first item Q(a) To decide .

Let's take an example and continue to understand backwards , There are three actions , We choose the one with the greatest probability $a_t$ Keep going down .（ Not implemented , It's just a simulation ）

(2).Expansion

Simulate the opponent according to the strategy network , Suppose that $a_t$ Yes , Use the strategy function to sample an action instead of the opponent's response . At this time, it is not to choose the action with the highest probability , All possible actions have their own probability to be drawn .

Introduce a little mathematical concept ：
State transition function $p(s_{t+1}|s_t,a_t)$：
Opponent's action ( Equivalent to the environment environment) A new state will be created $s_{t+1}$ , But we don't know how our opponents will respond , So I don't know the action transfer function . We use the policy function $\pi$ To replace the state transition function .

For example, we selected the probability of 0.4 The action of , Produced a state $s_{t+1}$：

(3).Evaluation

From the State $s_{t+1}$ Start , Let the strategic network game itself , Until the winner is decided （Fast Rollout）. Get a reward at the end of the game , Win, then $r_T=+1$; Lose rule $r_T=-1$.

Reward $r_T$ Used to evaluate $s_{t+1}$ The stand or fall of , Win and increase the score , Lose and reduce the score .

meanwhile , Value networks are also used to give $s_{t+1}$ score , This value network has been trained before . Input $s_{t+1}$ , Output score $v(s_{t+1};w)$, It can also reflect $s_{t+1}$ What is the odds of winning in this situation .

Comprehensive evaluation of two aspects , namely $V(s_{t+1})=\frac{1}{2}v(s_{t+1};w)+\frac{1}{2}r_T$, reflect $s_{t+1}$ Good or bad .

(4).Backup

The simulation process will be repeated many times , So there are many records in each state ：

And for every action $a_t$ There will be many such child nodes , Yes $a_t$ Average all records , As $a_t$ New values $Q(a_t)$, As a pair of actions $a_t$ The evaluation of .

This Q Value is used to select the best in the first step a.

Q The value is initialized to 0, This will be updated every time you search Q.
Here you can review the first step .
When the above four steps are repeated many, many times , At this time of action Good or bad It's already obvious , In this way, we can make a real game decision .、
An action a Of Q Values and $\pi$ The bigger the value is. , The number of times selected N(a) The greater the , So the action is good or bad through N(a) To reflect .AlphaGo Just choose N Value the biggest action to match .

(5). Matchmaking steps

A human chess player makes a move ,AlphaGo I will do a Monte Carlo tree search again , Put the Q and N Initialize to 0, Repeat the above four steps to make simulation and decision .

5.8 AlphGo Zero VS AlphaGo

Follow up on this part ,AlphaGo Zero Than AlphaGo More powerful , The main difference is this ：

AlphaGo Zero Not used Behavior Cloning , Did not learn human experience , But stronger .
When training strategy network , I used MCTS, Let the strategy network imitate MCTS The action made .
How to use , Subsequent complement .

x. Reference tutorial

Video Course ： Deep reinforcement learning （ whole ）_ Bili, Bili _bilibili
Video original address ：https://www.youtube.com/user/wsszju
Courseware address ：https://github.com/wangshusen/DeepLearning

Reinforcement learning - Learning notes 5 | AlphaGo More articles about

Strengthen the study of reading notes - 02 - Dobby O The tiger O Machine problems
# Strengthen the study of reading notes - 02 - Dobby O The tiger O Machine problems Learning notes : [Reinforcement Learning: An Introduction, Richard S. Sutton and An ...
Strengthen the study of reading notes - 05 - Monte Carlo method (Monte Carlo Methods)
Strengthen the study of reading notes - 05 - Monte Carlo method (Monte Carlo Methods) Learning notes : Reinforcement Learning: An Introduction, Richard S ...
Strengthen the study of reading notes - 06~07 - Timing difference learning (Temporal-Difference Learning)
Strengthen the study of reading notes - 06~07 - Timing difference learning (Temporal-Difference Learning) Learning notes : Reinforcement Learning: An Introductio ...
In depth learning course notes （ fourteen ） Deep reinforcement learning --- Proximal Policy Optimization (PPO)
In depth learning course notes ( fourteen ) Deep reinforcement learning --- Proximal Policy Optimization (PPO) 2018-07-17 16:54:51 Reference: https://b ...
In depth learning course notes （ 13、 ... and ） Deep reinforcement learning --- Strategy gradient method （Policy Gradient Methods）
In depth learning course notes ( 13、 ... and ) Deep reinforcement learning --- Strategy gradient method (Policy Gradient Methods) 2018-07-17 16:50:12 Reference:https://www.you ...
( turn ) A review of intensive learning ： from AlphaGo The power behind it is to share learning resources （ Attached paper ）
In this paper, from :http://mp.weixin.qq.com/s/aAHbybdbs_GtY8OyU6h5WA project | A review of intensive learning : from AlphaGo The power behind it is to share learning resources ( Attached paper ) original 201 ...
Strengthen the study of reading notes - 13 - Strategy gradient method (Policy Gradient Methods)
Strengthen the study of reading notes - 13 - Strategy gradient method (Policy Gradient Methods) Learning notes : Reinforcement Learning: An Introduction, Richa ...
Strengthen the study of reading notes - 12 - The mark of qualification (Eligibility Traces)
Strengthen the study of reading notes - 12 - The mark of qualification (Eligibility Traces) Learning notes : Reinforcement Learning: An Introduction, Richard S. S ...
Strengthen the study of reading notes - 11 - off-policy The approximate method of
Strengthen the study of reading notes - 11 - off-policy The approximate method of Learning notes : Reinforcement Learning: An Introduction, Richard S. Sutton and ...
Strengthen the study of reading notes - 10 - on-policy The approximate method of control
Strengthen the study of reading notes - 10 - on-policy The approximate method of control Learning notes : Reinforcement Learning: An Introduction, Richard S. Sutton an ...

Random recommendation

yii2 model Configure constants and... In the layer list
Save the machine I7
CPU : I7 4790K +Z97 ＝ 3200 Radiator : Kyushu Fengshen xuanbing 400 = 99 Hard disk : Seagate 1TB 64M = 310 The case : Golden river field surpasses white = 200 Memory DDR3 Kingston 8G = ...
DOM0,DOM2,DOM3 event , Introduction to event Basics
Event is javascript and HTML The foundation of interaction , Any interaction that takes place in a document or browser window , They all interact through binding events ; Incidents DOM0, DOM2 and DOM3 The distinction between ( Don't ask me why I'm missing one DOM1, I didn't find DOM ...
multiple alignment multiple alignment
I have only been exposed to double sequence alignment before , Now we need to start using multiple sequence alignment . Basic concepts : Multiple sequence alignment - Encyclopedias frequently-used multiple alignment Software : Muscle ClustalW T-coffee Between software ...
How to Change Password Complexity Requirements in Windows XP
Original Link: http://www.ehow.com/how_4812793_password-complexity-requirements-windows-xp.html#ixzz ...
LittleTool Batch modification of materials
using UnityEngine; using System.Collections; using UnityEditor; public class ChangeMaterial : Editor ...
Enhance offshore Java Development efficiency 10 A hint
This article comes from my in InfoQ Chinese station original article , The original address is :http://www.infoq.com/cn/news/2013/12/10-tips-offshore-java-effective Cygn ...
uml series （ 6、 ... and ）—— Behavior graph ： Activities & state
after one's words uml Static diagram of , The way uml Dynamic representation of . uml Behavior diagram for ,uml The behavior diagram of is mainly used to design the behavior of programs . As usual , Let's take a picture first : Behavior diagrams include activity diagrams and state diagrams . First, let's talk about the activity diagram : Activity diagram is composed of activity nodes and ...
IDEA The steps of hosting the project on to the code cloud
IDEA The steps of hosting the project on to the code cloud :1. install Git2.idea On the configuration Git Setting-Version Control-Git hold git.exe Change to installed Git The execution path of is as :D:\Prog ...
centos 4.4 Configuration and use
Our company's products use erlang Development , Can be in most of Linux Distribution installation uses , I'm personally in Ubuntu.Debian.SUSE After installation . But customers use Linux There are all kinds of distributions , The online environment is always weird , Expectation one ...