当前位置:网站首页>Reinforcement learning - incomplete observation problem, MCTs
Reinforcement learning - incomplete observation problem, MCTs
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
The knowledge points of this article are summarized from 《 Deep learning and intensive learning 》, If there is a mistake , Welcome to point out
Incomplete observation problem
Reinforcement learning algorithm introduced in previous blogs , Agents can know the overall state from the environment , Be similar to MOBA Game like viewing system , You can observe the movements of both sides on the map . But in some cases , The agent can only know the local state from the environment , For example, the king rongyaozhong , The vision mechanism causes each player to see only a small part of the state in the whole map .
set up t t t The state observed by the agent at any time is o t o_t ot, We can only rely on o t o_t ot Let the strategic network make decisions , But it usually doesn't work well . A better strategy is to make the strategy network based on the past t t t Predict the observed state at every moment , That is, the input of the policy network is [ o 1 、 o 2 、 . . . 、 o t ] [o_1、o_2、...、o_t] [o1、o2、...、ot] This sequence . have access to RNN、LSTM、Transformer Wait for the model to process such serialized data .

As shown in the figure above , With RNN For example , Observed n n n The state of the moment o n o_n on when , The sequence of [ o 1 、 o 2 、 . . . 、 o n ] [o_1、o_2、...、o_n] [o1、o2、...、on] Input to n n n individual CNN in , n n n individual CNN The output of is passing RNN To deal with ,RNN Of the n n n Output inputs into the fully connected network , The probability of each action performed by the output agent of the fully connected network
MCTS
MCTS( Monte carlo tree search ) It is a very important method in reinforcement learning , Applied to include Alpha Go、Alpha Go Zero Including many reinforcement learning algorithms .MCTS The basic principle of is to simulate what may happen in the future , So as to find the best action , It can be regarded as a wise enumeration , It also means that MCTS It will consume a lot of computing power .
MCTS Each simulation will choose an action , After performing this action , Play a game to the end , Evaluate the action according to the outcome .MCTS Include options (Selection)、 Expand (expansion)、 simulation (Simulation)、 to flash back (Backup) Four steps . It is worth mentioning that , Strategic networks and value networks will assist MCTS The calculation of , It's going on MCTS front , We need to train the strategic network and value network in advance .
choice
Suppose there is n n n An action ,MCTS Of “ choice ” The step is to choose the action with a higher chance of winning , The action value and strategy network calculated by simulation π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ) The action score given ( Probability value ) To evaluate the action a a a The stand or fall of . To be specific , It evaluates the quality of an action through the following formula :
s c o r e ( a ) = Q ( a ) + α 1 + N ( a ) π ( a ∣ s ; θ ) (1.0) score(a)=Q(a)+\frac{\alpha}{1+N(a)}\pi(a|s;\theta)\tag{1.0} score(a)=Q(a)+1+N(a)απ(a∣s;θ)(1.0)
among
- α \alpha α Is a super parameter
- N ( a ) N(a) N(a) Show action a a a The number of times it has been selected , Initial stage N ( a ) = 0 N(a)=0 N(a)=0, action a a a Every time you are selected , N ( a ) + 1 N(a)+1 N(a)+1
- Q ( a ) Q(a) Q(a) Before N ( a ) N(a) N(a) The action value calculated by this simulation , It is mainly determined by the winning rate and the value function . action a a a Every time you are selected , It will be updated once Q ( a ) Q(a) Q(a)
When the action a a a Not chosen out of date , Q ( a ) Q(a) Q(a) and N ( a ) N(a) N(a) Are all 0, here score(a) It is entirely up to the strategic network . When the action a a a When you are selected many times , Q ( a ) Q(a) Q(a) Yes score(a) The impact will be greater and greater , namely score(a) The value of depends more on the action value calculated according to the past simulation ( That is, the actual situation , Not the prediction of strategic network ). coefficient α 1 + N ( a ) \frac{\alpha}{1+N(a)} 1+N(a)α One of its functions is to encourage MCTS Select actions that have been executed less times in history , Suppose two actions have similar Q ( a ) Q(a) Q(a) and π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ), be N ( a ) N(a) N(a) Smaller actions will be selected .
In the selection stage ,MCTS Alternative 1.0 The action with the largest value , Let agents in Simulator Execute this action in ( It's not a real scene ).
Expand
The agent completes the first step in the simulator “ choice ” After the selected action , Simulate the next action made by the opponent through the expansion stage .MCTS Through the strategy network trained in advance π ( a ∣ s t ; θ ) \pi(a|s_t;\theta) π(a∣st;θ) Simulate your opponent , Extract actions according to the output probability of the policy network . It is worth mentioning that ,MCTS After performing the action in the simulator , The simulator needs to return to a new state , Therefore, the simulator needs to simulate the actual state transition function . For most problems , For example, driverless , It is quite difficult to simulate the actual state transition function . But for MOBA Class games 、 Go and other relatively simple problems , After the action made by the opponent , The board 、 The game match is the next state in the simulator , Therefore, it is only necessary to train a strategy network that can simulate the actual situation to simulate the state transition function .
simulation
“ choice ” and “ Expand ” Stage , Both the agent itself and the opponent have chosen an action , And execute in the simulator , Get a new state s t + 1 s_{t+1} st+1. And in the simulation stage , Both sides will simulate the subsequent match through two strategic networks trained in advance , Calculate the status according to the subsequent match s t + 1 s_{t+1} st+1 The value of . Take go as an example ,AlphaGo At this stage, the game will be simulated through two strategic networks , Until the game is over . If your side wins , So reward r r r The values for 1, And vice versa -1. Besides ,AlphaGo The value network trained in advance is introduced V ( s ; θ ) V(s;\theta) V(s;θ), Used to evaluate the status s t + 1 s_{t+1} st+1 The advantages and disadvantages of .AlphaGo To state s t + 1 s_{t+1} st+1 The evaluation of v ( s t + 1 ) v(s_{t+1}) v(st+1) by :
v ( s t + 1 ) = r + V ( s t + 1 ; θ ) 2 v(s_{t+1})=\frac{r+V(s_{t+1};\theta)}{2} v(st+1)=2r+V(st+1;θ)
Because the sampling is based on the probability of the output of the policy network , So in the State s t + 1 s_{t+1} st+1 The actions performed by each simulation may be different .MCTS In state s t + 1 s_{t+1} st+1 The simulation on will be carried out many times , As shown in the figure below , A state s t + 1 s_{t+1} st+1 Contains multiple comments .
to flash back
According to the previous section , The first t t t Step action a t a_t at There are many records about the status under , The draw value of these States is the formula 1.0 Medium Q ( a ) Q(a) Q(a). This value will be used in the calculation formula of the next iteration 1.0( The above four processes constitute MCTS An iteration of ,MCTS Will repeat the above four steps many times ).
Decision making
After thousands of iterations ,MCTS Select in state t t t The most frequently performed action , namely a = arg max a N ( a ) a=\argmax_a N(a) a=aargmaxN(a). After watching MCTS After the principle of , It's not hard to understand why AlphaGo and AlphaGo Zero Can win mankind , In fact, it is crushing by force .
Training of value network and strategy network
As mentioned earlier ,MCTS We need to use the value network and strategy network trained in advance to assist . This section will summarize AlphaGo and AlphaGo Zero How to train two networks .
AlphaGo
AlphaGo The process of training strategy network is
- Train the strategy network with supervisory data π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)
- Set the strategy network that is completely trained in the first step as π ( a ∣ s ; θ n o w ) \pi(a|s;\theta_{now}) π(a∣s;θnow), Choose a training strategy network randomly π ( a ∣ s : θ o l d ) \pi(a|s:\theta_{old}) π(a∣s:θold)( The parameters are fixed ). Let the two strategic networks simulate a match , if π ( a ∣ s ; θ n o w ) \pi(a|s;\theta_{now}) π(a∣s;θnow) victory , In return u u u by 1, Otherwise -1. Then use REINFORCE Update policy network π ( a ∣ s ; θ n o w ) \pi(a|s;\theta_{now}) π(a∣s;θnow) Parameters of , Repeat the process until it converges .
According to the above process , You can get a series of state arrays ( s t , u t ) (s_t,u_t) (st,ut), Using the above data , Let the value network v ( s ; θ ) v(s;\theta) v(s;θ) Conduct supervised learning , The loss function is MSE:
L ( θ ) = 1 N ∑ t = 1 N [ v ( s t ; θ ) − u t ] 2 L(\theta)=\frac{1}{N}\sum_{t=1}^N[v(s_t;\theta)-u_t]^2 L(θ)=N1t=1∑N[v(st;θ)−ut]2
AlphaGo Zero
AlphaGo Zero Use prepared MCTS( for example AlphaGo) Simulate the game between two players . Based on the current state s t s_t st,MCTS After many simulations , You can get n n n The number of actions executed N ( 1 ) 、 N ( 2 ) 、 . . . 、 N ( n ) N(1)、N(2)、...、N(n) N(1)、N(2)、...、N(n), Normalize it to get p t p_t pt, namely p t = n o r m a l i z e ( [ N ( 1 ) , N ( 2 ) , . . . , N ( n ) ] ) p_t=normalize([N(1),N(2),...,N(n)]) pt=normalize([N(1),N(2),...,N(n)])
set up MCTS Simulated two players pass m m m End a match after three steps , You can get a series of tracks of two players respectively
( s 1 , p 1 , u 1 ) 、 ( s 2 , p 2 , u 2 ) 、 . . . 、 ( s m , p m , u m ) (s_1,p_1,u_1)、(s_2,p_2,u_2)、...、(s_m,p_m,u_m) (s1,p1,u1)、(s2,p2,u2)、...、(sm,pm,um)
For the winner , Return u 1 = u 2 = . . . = u m = 1 u_1=u_2=...=u_m=1 u1=u2=...=um=1. For losers , Return u 1 = u 2 = . . . = u m = − 1 u_1=u_2=...=u_m=-1 u1=u2=...=um=−1, Use the above data to update the strategy network and value network ( Pay attention to the winners and losers , In trajectory data s s s and p p p Is different )
For policy networks π ( a ∣ s t ; θ ) \pi(a|s_t;\theta) π(a∣st;θ), It uses supervised learning training , set up H(x,y) Is the cross entropy loss , Then the loss function is
L ( θ ) = 1 N ∑ i = 1 N H ( p i , π ( a ∣ s i ; θ ) ) L(\theta)=\frac{1}{N}\sum_{i=1}^NH(p_i,\pi(a|s_i;\theta)) L(θ)=N1i=1∑NH(pi,π(a∣si;θ))
How to update the value network AlphaGo Agreement . The specific training process is to repeat the following three steps until the algorithm converges :
- Give Way MCTS Self game , Finish a game , Collect to m A triad ( s 1 , p 1 , u 1 ) 、 ( s 2 , p 2 , u 2 ) 、 . . . 、 ( s m , p m , u m ) (s_1,p_1,u_1)、(s_2,p_2,u_2)、...、(s_m,p_m,u_m) (s1,p1,u1)、(s2,p2,u2)、...、(sm,pm,um)
- According to the above triples , Update policy network parameters
- According to the above triples , Update value network parameters
It can be obtained. DeepMind It's a local tyrant .
边栏推荐
- matplotlib数据可视化
- 深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
- ssh/scp断点续传rsync
- Xshell suddenly failed to connect to the virtual machine
- 强化学习——多智能体强化学习
- Interface anti duplicate submission
- Pytorch deep learning single card training and multi card training
- How to use Bert
- 分布式集群架构场景优化解决方案:分布式ID解决方案
- Digital collections "chaos", 100 billion market change is coming?
猜你喜欢

It's not easy to travel. You can use digital collections to brush the sense of existence in scenic spots

使用神经网络实现对天气的预测

Distributed lock redis implementation

Which is more reliable for small program development?

强化学习——不完全观测问题、MCTS

【六】redis缓存策略

微信小程序开发语言一般有哪些?

Deep learning (self supervision: Moco V2) -- improved bases with momentum contractual learning

Structured streaming in spark

深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
随机推荐
强化学习——多智能体强化学习
Dataset class loads datasets in batches
How to do wechat group purchase applet? How much does it usually cost?
Construction of redis master-slave architecture
Ssh/scp breakpoint resume Rsync
4个角度教你选小程序开发工具?
Deep learning (self supervision: simpl) -- a simple framework for contractual learning of visual representations
svn incoming内容无法更新下来,且提交报错:svn: E155015: Aborting commit: XXX remains in conflict
Small program development solves the anxiety of retail industry
Svn incoming content cannot be updated, and submission error: svn: e155015: aborting commit: XXX remains in conflict
小程序开发解决零售业的焦虑
Wechat applet development and production should pay attention to these key aspects
深度学习——Patches Are All You Need
Mysql5.6 (according to.Ibd,.Frm file) restore single table data
强化学习——价值学习中的DQN
How much does it cost to make a small program mall? What are the general expenses?
【6】 Redis cache policy
transformer的理解
Pytorch deep learning single card training and multi card training
【二】redis基础命令与使用场景