当前位置:网站首页>Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation
Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation
2022-07-29 01:42:00 【wweweiweiweiwei】
Reinforcement learning ( 3、 ... and ):DQN、Nature DQN、Double DQN, Attached source code interpretation
This is not the reinforcement learning that I have been learning recently , One advantage of his lecture is that although he has a little knowledge , But you can have a clearer understanding by checking the blog online and combining its code .
This article introduces my understanding of DQN And its improved algorithm understanding and don't bother python Partial interpretation of the code .
1 DQN
Traditional reinforcement learning has the problem of exploding when the state is too multidimensional , If you store them all in tables , I'm afraid the computer will run out of memory , And it's also a time-consuming thing to search the corresponding status in such a large table every time , Using neural network in machine learning to replace can solve this problem well . This is it. DQN The origin of , Use neural networks to represent Q function , The weight of each layer of network is the corresponding value function , Input state, Output each action Corresponding Q value .
DQN about Q-learning There are mainly two parts to the revision of :
One is to use the depth convolution neural network to approximate the value function , To put it bluntly, it is to replace with convolution layer Q The calculation of the value , routine Q-learning It's input state return Q value , and DQN It's input state Output through the network Q value ;
The other is DQN Use experience playback training to strengthen the learning process in learning : Because there is correlation between the data collected through reinforcement learning , Using these data to train neural networks in sequence may be unstable , Break the correlation between data through experience playback . The following code interpretation will talk about how to do this part .
This is a DQN Algorithm flow of :

2 Nature DQN
stay DQN In the algorithm, , The same method is used to calculate the target value and the current value Q The Internet ( I don't know whether this part is my wrong understanding or don't bother the boss didn't make it clear , In the video DQN Just use two Q The Internet , But actually it should be Nature DQN Just started using two Q The Internet ), and Nature DQN With the two Q The Internet , The significance of this is because if you use Q The output of the network is updated Q Network parameters , Too dependent , Not conducive to the convergence of the algorithm , Therefore, it is proposed to use two Q The Internet .
One Q The network is used to select actions and update parameters , the other one Q The Internet ( be called Q’) Used to calculate the target value ,Q’ The network will not be updated iteratively , But every once in a while from Q Copy parameters in the network , So the two one. Q The structure of the network is consistent .
Nature DQN comparison DQN The improvement of the algorithm mainly lies in the calculation of the final target value :

3 Double DQN
Double DQN Algorithm is used to solve DQN and Nature DQN The problem of over estimation in the algorithm , Overestimation means that the estimated value function value is higher than the real value function value , The reason lies in Q-learning Maximization in , That's right , That's it. max, It makes the final algorithm model have great deviation .DDQN(Double DQN) By decoupling goals Q The choice and goal of value action Q These two steps of value calculation can solve the problem of over estimation .
The next picture is Double DQN and Nature DQN Algorithmic differences :

Source code interpretation
Don't bother the boss when interpreting the source code Double DQN, After all, these three algorithms are very similar , So pick the most complicated one directly .
First of all run_Pendulum.py:
env = gym.make('Pendulum-v0')
env = env.unwrapped
The purpose of the first line is from gym Get in Pendulum Model , This is nothing , Mainly the second line , Because the model we got env We can't get its parameters , It's packaged , You need to run the second line of command to get the internal parameters of the encapsulated model ;
f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4)
Discretize continuous actions , The actions after discretization share 11 individual , Spread the values of actions to -2~2 Between , That is to say -2、-1.6、-1.2、…、1.2、1.6、2.0 altogether 11 An action , Later, we will calculate the corresponding Q value ;
The main function is nothing , Mainly RL_brain.py:
The first is initialization , There is one n_features, It is 3, It represents the state , It's just expressed in vector form , The state of the content of reinforcement learning in the previous two chapters is actually a coordinate , The model of this article is different , So its state representation is also different , The state vector here has three data , Namely theta Cosine of 、 Sine value and theta The derivative of theta_dot, This can be done by looking at the code of the model Pendulum.py find out ;
memory_size The capacity of the memory bank is 3000 , Set to all zeros at first , Later, the data will be overwritten and stored , Size is 3000×8;8 It means s、a、r、s_, among s It's the state vector n_features, So it adds up 8;
every last Q The network has two layers , The convolution kernel dimension of the first convolution layer is 3×20, The offset is 1×20; The convolution kernel dimension of the second convolution layer is 20×11, The offset is 1×11, As for this convolution kernel dimension, you can understand it by yourself , It's a little troublesome to explain ;
The input to the network is the state vector n_features, Without much explanation ;
Look at the back store_transition function :
def store_transition(self, s, a, r, s_):
if not hasattr(self, 'memory_counter'):
self.memory_counter = 0
transition = np.hstack((s, [a, r], s_)) # Become one-dimensional list Easy to store
index = self.memory_counter % self.memory_size # The cleverness of this paragraph lies in covering storage
self.memory[index, :] = transition
self.memory_counter += 1
The subtlety of this section of function is to overwrite the storage , Through the first np.hstack Function will s、a、r、s_ Become one-dimensional 8 Data , Then deposit memory_size in , As mentioned earlier memory_size Dimensions can be matched . There's one in the back index The calculation of , This calculation is the remainder , It means that if self.memory_counter exceed 3000 了 , It's in memory_size From scratch , Overwrite the existing data , If not more than 3000, That was defined at the beginning memory_size It's all zero , Also directly cover ; This code is well understood , It's great ( Anyway, I can't do this step if I want to write );
def choose_action(self, observation):
observation = observation[np.newaxis, :] # Add a dimension
actions_value = self.sess.run(self.q_eval, feed_dict={
self.s: observation}) # Output through the network is (1,11) Dimensions
action = np.argmax(actions_value) # Find the maximum index
if not hasattr(self, 'q'): # record action value it gets
self.q = []
self.running_q = 0
self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value) # In order to make the picture more smooth , In fact, it should be the following step
#self.running_q = np.max(actions_value)
self.q.append(self.running_q)
if np.random.uniform() > self.epsilon: # choosing action
action = np.random.randint(0, self.n_actions) #11 Choose randomly from the actions , Otherwise, in accordance with the Q-eval Network output selection Q The one worth the most
return action
The selection action here is randomly selected at the beginning , This is a feature of reinforcement learning , Then after reaching a certain number of steps, you will start to make informed choices , This code is the last if sentence , At the start of the self.epsilon yes 0, It will slowly increase later , So the back cannot enter if In the sentence , You won't choose actions randomly ; As for the front self.running_q The purpose of this line of code is to make the image drawn later smoother , Nothing else , Look at the following two pictures to understand :
self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value):
self.running_q = np.max(actions_value):

hinder learn function :
if self.learn_step_counter % self.replace_target_iter == 0: # from 3000 Step start every 200 Step update once
self.sess.run(self.replace_target_op)
print('\ntarget_params_replaced\n') # Check whether to replace target_net Parameters , Every time 200 Step update once
Here is the front 3000 Step is not updated Q’ Network parameters , from 3000 Step start update , Every time 200 Step update once ;
if self.memory_counter > self.memory_size:
sample_index = np.random.choice(self.memory_size, size=self.batch_size) # Random selection batch_size Data size , The return is index
else:
sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
batch_memory = self.memory[sample_index, :] # Read the corresponding index data
Random selection memory_size The data in is used for subsequent network input , Random selection is to disrupt the correlation between data , The previous theoretical part mentioned ;
The following code is Double DQN Algorithm. , There is no specific analysis .
边栏推荐
猜你喜欢
随机推荐
Window object of BOM series
560 and K
Regular checksum time formatting
matplotlib中文问题
我们总结了 3 大Nacos使用建议,并首次公开 Nacos 3.0 规划图 Nacos 开源 4 周年
【Web技术】1395- Esbuild Bundler HMR
JS event introduction
Use of resttemplate and Eureka
关于df[‘某一列名’][序号]
Common functions and usage of numpy
Cloud native application comprehensive exercise
It is found that the data of decimal type in the database can be obtained through resultset.getdouble, but this attribute cannot be obtained through GetObject.
[hcip] MPLS Foundation
BOM系列之定时器
ELS stop at all
Platofarm community ecological gospel, users can get premium income with elephant swap
SQL question brushing: find the current salary details and department number Dept_ no
Focus on differentiated product design, intelligent technology efficiency improvement and literacy education around new citizen Finance
Plato launched the LAAS protocol elephant swap, which allows users to earn premium income
地下水、土壤、地质、环境人看过来









