当前位置：网站首页>Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation

Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation

2022-07-29 01:42:00 【wweweiweiweiwei】

Reinforcement learning （ 3、 ... and ）：DQN、Nature DQN、Double DQN, Attached source code interpretation

This is not the reinforcement learning that I have been learning recently , One advantage of his lecture is that although he has a little knowledge , But you can have a clearer understanding by checking the blog online and combining its code .

This article introduces my understanding of DQN And its improved algorithm understanding and don't bother python Partial interpretation of the code .

1 DQN

Traditional reinforcement learning has the problem of exploding when the state is too multidimensional , If you store them all in tables , I'm afraid the computer will run out of memory , And it's also a time-consuming thing to search the corresponding status in such a large table every time , Using neural network in machine learning to replace can solve this problem well . This is it. DQN The origin of , Use neural networks to represent Q function , The weight of each layer of network is the corresponding value function , Input state, Output each action Corresponding Q value .

DQN about Q-learning There are mainly two parts to the revision of ：

One is to use the depth convolution neural network to approximate the value function , To put it bluntly, it is to replace with convolution layer Q The calculation of the value , routine Q-learning It's input state return Q value , and DQN It's input state Output through the network Q value ;

The other is DQN Use experience playback training to strengthen the learning process in learning ： Because there is correlation between the data collected through reinforcement learning , Using these data to train neural networks in sequence may be unstable , Break the correlation between data through experience playback . The following code interpretation will talk about how to do this part .

This is a DQN Algorithm flow of ：

Insert picture description here
2 Nature DQN

stay DQN In the algorithm, , The same method is used to calculate the target value and the current value Q The Internet （ I don't know whether this part is my wrong understanding or don't bother the boss didn't make it clear , In the video DQN Just use two Q The Internet , But actually it should be Nature DQN Just started using two Q The Internet ）, and Nature DQN With the two Q The Internet , The significance of this is because if you use Q The output of the network is updated Q Network parameters , Too dependent , Not conducive to the convergence of the algorithm , Therefore, it is proposed to use two Q The Internet .

One Q The network is used to select actions and update parameters , the other one Q The Internet （ be called Q’） Used to calculate the target value ,Q’ The network will not be updated iteratively , But every once in a while from Q Copy parameters in the network , So the two one. Q The structure of the network is consistent .

Nature DQN comparison DQN The improvement of the algorithm mainly lies in the calculation of the final target value ：

Insert picture description here
3 Double DQN

Double DQN Algorithm is used to solve DQN and Nature DQN The problem of over estimation in the algorithm , Overestimation means that the estimated value function value is higher than the real value function value , The reason lies in Q-learning Maximization in , That's right , That's it. max, It makes the final algorithm model have great deviation .DDQN（Double DQN） By decoupling goals Q The choice and goal of value action Q These two steps of value calculation can solve the problem of over estimation .

The next picture is Double DQN and Nature DQN Algorithmic differences ：

Insert picture description here
Source code interpretation

Don't bother the boss when interpreting the source code Double DQN, After all, these three algorithms are very similar , So pick the most complicated one directly .

First of all run_Pendulum.py：

env = gym.make('Pendulum-v0')
env = env.unwrapped

The purpose of the first line is from gym Get in Pendulum Model , This is nothing , Mainly the second line , Because the model we got env We can't get its parameters , It's packaged , You need to run the second line of command to get the internal parameters of the encapsulated model ;

f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4)

Discretize continuous actions , The actions after discretization share 11 individual , Spread the values of actions to -2~2 Between , That is to say -2、-1.6、-1.2、…、1.2、1.6、2.0 altogether 11 An action , Later, we will calculate the corresponding Q value ;

The main function is nothing , Mainly RL_brain.py：

The first is initialization , There is one n_features, It is 3, It represents the state , It's just expressed in vector form , The state of the content of reinforcement learning in the previous two chapters is actually a coordinate , The model of this article is different , So its state representation is also different , The state vector here has three data , Namely theta Cosine of 、 Sine value and theta The derivative of theta_dot, This can be done by looking at the code of the model Pendulum.py find out ;

memory_size The capacity of the memory bank is 3000 , Set to all zeros at first , Later, the data will be overwritten and stored , Size is 3000×8;8 It means s、a、r、s_, among s It's the state vector n_features, So it adds up 8;

every last Q The network has two layers , The convolution kernel dimension of the first convolution layer is 3×20, The offset is 1×20; The convolution kernel dimension of the second convolution layer is 20×11, The offset is 1×11, As for this convolution kernel dimension, you can understand it by yourself , It's a little troublesome to explain ;

The input to the network is the state vector n_features, Without much explanation ;

Look at the back store_transition function ：

    def store_transition(self, s, a, r, s_):
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0
        transition = np.hstack((s, [a, r], s_))  # Become one-dimensional list Easy to store 
        index = self.memory_counter % self.memory_size   # The cleverness of this paragraph lies in covering storage 
        self.memory[index, :] = transition
        self.memory_counter += 1

The subtlety of this section of function is to overwrite the storage , Through the first np.hstack Function will s、a、r、s_ Become one-dimensional 8 Data , Then deposit memory_size in , As mentioned earlier memory_size Dimensions can be matched . There's one in the back index The calculation of , This calculation is the remainder , It means that if self.memory_counter exceed 3000 了 , It's in memory_size From scratch , Overwrite the existing data , If not more than 3000, That was defined at the beginning memory_size It's all zero , Also directly cover ; This code is well understood , It's great （ Anyway, I can't do this step if I want to write ）;

    def choose_action(self, observation):
        observation = observation[np.newaxis, :]   # Add a dimension 
        actions_value = self.sess.run(self.q_eval, feed_dict={
    self.s: observation})   # Output through the network is (1,11) Dimensions 
        action = np.argmax(actions_value)   # Find the maximum index

        if not hasattr(self, 'q'):  # record action value it gets
            self.q = []
            self.running_q = 0
        self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value)  # In order to make the picture more smooth , In fact, it should be the following step 
        #self.running_q = np.max(actions_value)
        self.q.append(self.running_q)

        if np.random.uniform() > self.epsilon:  # choosing action
            action = np.random.randint(0, self.n_actions)  #11 Choose randomly from the actions , Otherwise, in accordance with the Q-eval Network output selection Q The one worth the most 
        return action

The selection action here is randomly selected at the beginning , This is a feature of reinforcement learning , Then after reaching a certain number of steps, you will start to make informed choices , This code is the last if sentence , At the start of the self.epsilon yes 0, It will slowly increase later , So the back cannot enter if In the sentence , You won't choose actions randomly ; As for the front self.running_q The purpose of this line of code is to make the image drawn later smoother , Nothing else , Look at the following two pictures to understand ：

self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value)：
Insert picture description here
self.running_q = np.max(actions_value)：

Insert picture description here
hinder learn function ：

        if self.learn_step_counter % self.replace_target_iter == 0:   # from 3000 Step start every 200 Step update once 
            self.sess.run(self.replace_target_op)
            print('\ntarget_params_replaced\n')   # Check whether to replace target_net Parameters , Every time 200 Step update once

Here is the front 3000 Step is not updated Q’ Network parameters , from 3000 Step start update , Every time 200 Step update once ;

        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)   # Random selection batch_size Data size , The return is index
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        batch_memory = self.memory[sample_index, :]   # Read the corresponding index data

Random selection memory_size The data in is used for subsequent network input , Random selection is to disrupt the correlation between data , The previous theoretical part mentioned ;

The following code is Double DQN Algorithm. , There is no specific analysis .

原网站

版权声明
本文为[wweweiweiweiwei]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130554570790.html

当前位置：网站首页>Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation

Reinforcement learning (III): dqn, nature dqn, double dqn, with source code interpretation

Reinforcement learning （ 3、 ... and ）：DQN、Nature DQN、Double DQN, Attached source code interpretation

边栏推荐

猜你喜欢

随机推荐