当前位置:网站首页>Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation
Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation
2022-07-29 01:42:00 【wweweiweiweiwei】
Reinforcement learning ( 3、 ... and ):DQN、Nature DQN、Double DQN, Attached source code interpretation
This is not the reinforcement learning that I have been learning recently , One advantage of his lecture is that although he has a little knowledge , But you can have a clearer understanding by checking the blog online and combining its code .
This article introduces my understanding of DQN And its improved algorithm understanding and don't bother python Partial interpretation of the code .
1 DQN
Traditional reinforcement learning has the problem of exploding when the state is too multidimensional , If you store them all in tables , I'm afraid the computer will run out of memory , And it's also a time-consuming thing to search the corresponding status in such a large table every time , Using neural network in machine learning to replace can solve this problem well . This is it. DQN The origin of , Use neural networks to represent Q function , The weight of each layer of network is the corresponding value function , Input state, Output each action Corresponding Q value .
DQN about Q-learning There are mainly two parts to the revision of :
One is to use the depth convolution neural network to approximate the value function , To put it bluntly, it is to replace with convolution layer Q The calculation of the value , routine Q-learning It's input state return Q value , and DQN It's input state Output through the network Q value ;
The other is DQN Use experience playback training to strengthen the learning process in learning : Because there is correlation between the data collected through reinforcement learning , Using these data to train neural networks in sequence may be unstable , Break the correlation between data through experience playback . The following code interpretation will talk about how to do this part .
This is a DQN Algorithm flow of :

2 Nature DQN
stay DQN In the algorithm, , The same method is used to calculate the target value and the current value Q The Internet ( I don't know whether this part is my wrong understanding or don't bother the boss didn't make it clear , In the video DQN Just use two Q The Internet , But actually it should be Nature DQN Just started using two Q The Internet ), and Nature DQN With the two Q The Internet , The significance of this is because if you use Q The output of the network is updated Q Network parameters , Too dependent , Not conducive to the convergence of the algorithm , Therefore, it is proposed to use two Q The Internet .
One Q The network is used to select actions and update parameters , the other one Q The Internet ( be called Q’) Used to calculate the target value ,Q’ The network will not be updated iteratively , But every once in a while from Q Copy parameters in the network , So the two one. Q The structure of the network is consistent .
Nature DQN comparison DQN The improvement of the algorithm mainly lies in the calculation of the final target value :

3 Double DQN
Double DQN Algorithm is used to solve DQN and Nature DQN The problem of over estimation in the algorithm , Overestimation means that the estimated value function value is higher than the real value function value , The reason lies in Q-learning Maximization in , That's right , That's it. max, It makes the final algorithm model have great deviation .DDQN(Double DQN) By decoupling goals Q The choice and goal of value action Q These two steps of value calculation can solve the problem of over estimation .
The next picture is Double DQN and Nature DQN Algorithmic differences :

Source code interpretation
Don't bother the boss when interpreting the source code Double DQN, After all, these three algorithms are very similar , So pick the most complicated one directly .
First of all run_Pendulum.py:
env = gym.make('Pendulum-v0')
env = env.unwrapped
The purpose of the first line is from gym Get in Pendulum Model , This is nothing , Mainly the second line , Because the model we got env We can't get its parameters , It's packaged , You need to run the second line of command to get the internal parameters of the encapsulated model ;
f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4)
Discretize continuous actions , The actions after discretization share 11 individual , Spread the values of actions to -2~2 Between , That is to say -2、-1.6、-1.2、…、1.2、1.6、2.0 altogether 11 An action , Later, we will calculate the corresponding Q value ;
The main function is nothing , Mainly RL_brain.py:
The first is initialization , There is one n_features, It is 3, It represents the state , It's just expressed in vector form , The state of the content of reinforcement learning in the previous two chapters is actually a coordinate , The model of this article is different , So its state representation is also different , The state vector here has three data , Namely theta Cosine of 、 Sine value and theta The derivative of theta_dot, This can be done by looking at the code of the model Pendulum.py find out ;
memory_size The capacity of the memory bank is 3000 , Set to all zeros at first , Later, the data will be overwritten and stored , Size is 3000×8;8 It means s、a、r、s_, among s It's the state vector n_features, So it adds up 8;
every last Q The network has two layers , The convolution kernel dimension of the first convolution layer is 3×20, The offset is 1×20; The convolution kernel dimension of the second convolution layer is 20×11, The offset is 1×11, As for this convolution kernel dimension, you can understand it by yourself , It's a little troublesome to explain ;
The input to the network is the state vector n_features, Without much explanation ;
Look at the back store_transition function :
def store_transition(self, s, a, r, s_):
if not hasattr(self, 'memory_counter'):
self.memory_counter = 0
transition = np.hstack((s, [a, r], s_)) # Become one-dimensional list Easy to store
index = self.memory_counter % self.memory_size # The cleverness of this paragraph lies in covering storage
self.memory[index, :] = transition
self.memory_counter += 1
The subtlety of this section of function is to overwrite the storage , Through the first np.hstack Function will s、a、r、s_ Become one-dimensional 8 Data , Then deposit memory_size in , As mentioned earlier memory_size Dimensions can be matched . There's one in the back index The calculation of , This calculation is the remainder , It means that if self.memory_counter exceed 3000 了 , It's in memory_size From scratch , Overwrite the existing data , If not more than 3000, That was defined at the beginning memory_size It's all zero , Also directly cover ; This code is well understood , It's great ( Anyway, I can't do this step if I want to write );
def choose_action(self, observation):
observation = observation[np.newaxis, :] # Add a dimension
actions_value = self.sess.run(self.q_eval, feed_dict={
self.s: observation}) # Output through the network is (1,11) Dimensions
action = np.argmax(actions_value) # Find the maximum index
if not hasattr(self, 'q'): # record action value it gets
self.q = []
self.running_q = 0
self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value) # In order to make the picture more smooth , In fact, it should be the following step
#self.running_q = np.max(actions_value)
self.q.append(self.running_q)
if np.random.uniform() > self.epsilon: # choosing action
action = np.random.randint(0, self.n_actions) #11 Choose randomly from the actions , Otherwise, in accordance with the Q-eval Network output selection Q The one worth the most
return action
The selection action here is randomly selected at the beginning , This is a feature of reinforcement learning , Then after reaching a certain number of steps, you will start to make informed choices , This code is the last if sentence , At the start of the self.epsilon yes 0, It will slowly increase later , So the back cannot enter if In the sentence , You won't choose actions randomly ; As for the front self.running_q The purpose of this line of code is to make the image drawn later smoother , Nothing else , Look at the following two pictures to understand :
self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value):
self.running_q = np.max(actions_value):

hinder learn function :
if self.learn_step_counter % self.replace_target_iter == 0: # from 3000 Step start every 200 Step update once
self.sess.run(self.replace_target_op)
print('\ntarget_params_replaced\n') # Check whether to replace target_net Parameters , Every time 200 Step update once
Here is the front 3000 Step is not updated Q’ Network parameters , from 3000 Step start update , Every time 200 Step update once ;
if self.memory_counter > self.memory_size:
sample_index = np.random.choice(self.memory_size, size=self.batch_size) # Random selection batch_size Data size , The return is index
else:
sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
batch_memory = self.memory[sample_index, :] # Read the corresponding index data
Random selection memory_size The data in is used for subsequent network input , Random selection is to disrupt the correlation between data , The previous theoretical part mentioned ;
The following code is Double DQN Algorithm. , There is no specific analysis .
边栏推荐
- 5G 商用第三年:无人驾驶的“上山”与“下海”
- SQL question brushing: find the last of all employees_ Name and first_ Name and corresponding department number Dept_ no
- ValueError: Colors must be aRGB hex values
- Understand all the basic grammar of ES6 in one article
- [search] - iteration deepening / bidirectional dfs/ida*
- Comprehensive upgrade, complete collection of Taobao / tmall API interfaces
- 【HCIP】两个MGRE网络通过OSPF实现互联(eNSP)
- Flask project construction 2
- DVWA之SQL注入
- Self-attention neural architecture search for semantic image segmentation
猜你喜欢

【搜索】—— 迭代加深/双向DFS/IDA*

Platofarm community ecological gospel, users can get premium income with elephant swap

【观察】三年跃居纯公有云SaaS第一,用友YonSuite的“飞轮效应”

嵌入式分享合集23

Timer of BOM series

PLATO上线LAAS协议Elephant Swap,用户可借此获得溢价收益

C language 300 lines of code to achieve mine sweeping (deployable + markable + changeable difficulty level)

SQL question brushing: find the current salary details and department number Dept_ no

PLATO上线LAAS协议Elephant Swap,用户可借此获得溢价收益

OpenGL development with QT (II) drawing cube
随机推荐
els 到底停止
It is found that the data of decimal type in the database can be obtained through resultset.getdouble, but this attribute cannot be obtained through GetObject.
了解网址url的组成后 运用url模块、querystring模块和mime模块完善静态网站
易观分析:以用户为中心,提升手机银行用户体验,助力用户价值增长
Timer of BOM series
PLATO上线LAAS协议Elephant Swap,用户可借此获得溢价收益
ELMO,BERT和GPT简介
Openpyxl merge cells
Tomorrow infinite plan, 2022 conceptual planning scheme for a company's yuanuniverse product launch
嵌入式分享合集23
body中基本标签
SQL injection of DVWA
els 新的方块落下
matplotlib中文问题
[hcip] two mGRE networks are interconnected through OSPF (ENSP)
Comprehensive upgrade, complete collection of Taobao / tmall API interfaces
redis安装,集群部署与常见调优
如何选择专业、安全、高性能的远程控制软件
明日无限计划,2022某公司元宇宙产品发布会活动概念策划方案
Pinduoduo can use many API interfaces