当前位置:网站首页>Strengthen basic learning records
Strengthen basic learning records
2022-07-06 13:52:00 【I like the strengthened Xiaobai in Curie】
DQN Strengthen learning record
DQN The algorithm is Q-learning The combination of algorithm and deep neural network (Deep-Q-Network), Used to solve the problem of too high dimension .
One 、 Introduction to the environment
What we use here is gym Environmental ’CartPole-v0’, Here is a brief introduction , Detailed introduction with links .
link : OpenAI Gym Introduction to classic control environment ——CartPole( Inverted pendulum )
(1) The rules of the game : There is a car in the game , There is a pole on it , The initial state will be different after each reset .
<1> The inclination angle of the pole θ Not greater than 15°
<2> Where the trolley moves x It needs to be kept within a certain range ( From the middle to both sides 2.4 Unit length )
(2) The state space : Here the state space is continuous , The number of States is 4 individual ;
(3) Action space : Here the action space is discrete ,0 Represents shift left ,1 For shift right .
(4) Reward : stay gym Of Cart Pole Environmental Science (env) Inside , Moving the car left or right action after ,env Will return a +1 Of reward.
Two 、 A brief introduction to the algorithm
- Value-based
- Off-Policy
- DQN Characteristics :
<1> The use of neural networks :
When the state action space is large or continuous , Unable to get Q Table storage status - Action value Q(s,a). So we can use neural network , The relationship between fitting state and value . among , The input is the status value , The output is the corresponding q value .
DQN It is generally used to solve discrete action space problems . Because in the continuous action space , It is impossible to enumerate actions one by one , To find the corresponding q value . If you want to solve the problem of continuous action space , Need to introduce AC frame .
<2> Experience the use of playback mechanism :
Experience playback is a technology that makes the experience probability distribution stable , It can improve the stability of training . Experience playback mainly includes “ Storage ” and “ The playback ” Two key steps :
Experience storage : Each step , Intelligent experience stores a (s,a,r,s_,done) Track of , Also called transition, Store this record in the experience pool .
Experience playback : In program implementation , When the stored experience is greater than the set value , You can be in the experience pool , Equal probability extraction BATCH_SIZE Experience training , This breaks the correlation between data , At the same time, reuse experience , It also improves the utilization of data . In practical terms , Maybe according to the importance of experience , Priority playback according to weight .
<3> Use of the target network :
link : Target network
Simply speaking ,DQN Two networks are introduced in , One is behavioral network , One is the target network , Their structures and parameters are the same , Only the parameter lags , Every once in a while , Update the target network . The update here can be hard updated , Feed the parameters directly , You can also make soft updates , Update by weight . This lagging update , It's stable Q Learning on the Internet .
The target network does not carry out reverse transmission , Equivalent to a reference , Behavioral networks through training , Back propagation , The closer to the target network , Instructions for Q Value evaluation is more accurate . - Pseudo code
- Realization
link : Don't worry about it
Refer to the implementation of Mo fan , Some of them , I don't quite understand , Annotated .
import matplotlib.pyplot as plt
import torch # Import torch
import torch.nn as nn # Import torch.nn
import torch.nn.functional as F # Import torch.nn.functional
import numpy as np # Import numpy
import gym # Import gym
# Hyperparameters
BATCH_SIZE = 32 # Number of samples
LR = 0.01 # Learning rate
EPSILON = 0.9 # greedy policy
GAMMA = 0.9 # reward discount
TARGET_REPLACE_ITER = 100 # Target network update frequency
MEMORY_CAPACITY = 2000 # Memory capacity
env = gym.make('CartPole-v0').unwrapped # Use gym Environment in the Library :CartPole, And open the package ( If you want to understand the environment , Please Baidu )
N_ACTIONS = env.action_space.n # The number of pole movements (2 individual )
N_STATES = env.observation_space.shape[0] # Number of pole States (4 individual )
""" torch.nn It's a modular interface designed specifically for neural networks .nn Built on Autograd above , It can be used to define and run neural networks . nn.Module yes nn A very important class in , It includes the definition and of each layer of the network forward Method . Defining network : Need to inherit nn.Module class , And implement forward Method . Generally, the layers with learnable parameters in the network are placed in the constructor __init__() in . As long as nn.Module Is defined in subclass of forward function ,backward The function will be implemented automatically ( utilize Autograd). """
# Definition Net class ( Defining network )
class Net(nn.Module):
def __init__(self): # Definition Net A series of properties of
# nn.Module The subclass function of must execute the constructor of the parent class in the constructor
super(Net, self).__init__() # Equivalence and nn.Module.__init__()
self.fc1 = nn.Linear(N_STATES, 50) # Set up the first full connection layer ( Input layer to hidden layer ): State several neurons to 50 Neurons
self.fc1.weight.data.normal_(0, 0.1) # Weight initialization ( The mean for 0, The variance of 0.1 Is a normal distribution )
self.out = nn.Linear(50, N_ACTIONS) # Set up the second full connection layer ( Hidden layer to output layer ): 50 From neurons to action neurons
self.out.weight.data.normal_(0, 0.1) # Weight initialization ( The mean for 0, The variance of 0.1 Is a normal distribution )
def forward(self, x): # Definition forward function (x For state )
x = F.relu(self.fc1(x)) # Connect the input layer to the hidden layer , And use the excitation function ReLU To process the value after passing through the hidden layer
actions_value = self.out(x) # Connect the hidden layer to the output layer , Get the final output value ( Action value )
return actions_value # Return the action value
# Definition DQN class ( Define two networks )
class DQN(object):
def __init__(self): # Definition DQN A series of properties of
self.eval_net, self.target_net = Net(), Net() # utilize Net Create two neural networks : Evaluate the network and target network
self.learn_step_counter = 0 # for target updating
self.memory_counter = 0 # for storing memory
self.memory = np.zeros((MEMORY_CAPACITY, N_STATES * 2 + 2)) # Initialize the memory , One line represents one transition
self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=LR) # Use Adam Optimizer ( Input is the parameters and learning rate of the evaluation network )
self.loss_func = nn.MSELoss() # Use the mean square loss function (loss(xi, yi)=(xi-yi)^2)
def choose_action(self, x): # Define the action selection function (x For state )
x = torch.unsqueeze(torch.FloatTensor(x), 0) # take x convert to 32-bit floating point form , And in dim=0 Increase the dimension to 1 Dimensions
if np.random.uniform() < EPSILON: # Generate a new one in [0, 1) The random number in , If it is less than EPSILON, Choose the best action
actions_value = self.eval_net.forward(x) # By evaluating the network input status x, Forward propagation obtains action value
#torch.max(a,1) Presentation selection a The largest element in each line ,[1] Represents an index similar to a key value pair [0] Express the numpy Get the one-dimensional array of
action = torch.max(actions_value, 1)[1].data.numpy() # Output the index of the maximum value of each row , And into numpy ndarray form
action = action[0] # Output action The first number of
else: # Randomly choose actions
action = np.random.randint(0, N_ACTIONS) # here action Random equals 0 or 1 (N_ACTIONS = 2)
return action # Return to the selected action (0 or 1)
def store_transition(self, s, a, r, s_): # Define memory storage functions ( Enter here as a transition)
transition = np.hstack((s, [a, r], s_)) # Concatenate arrays horizontally
# If the memory bank is full , Then overwrite the old data
index = self.memory_counter % MEMORY_CAPACITY # obtain transition The number of rows to put
self.memory[index, :] = transition # Implantation transition
self.memory_counter += 1 # memory_counter Self adding 1
def learn(self): # Defining learning functions ( When the memory bank is full, start learning )
# Target network parameter update
if self.learn_step_counter % TARGET_REPLACE_ITER == 0: # Trigger at the beginning , Then each 100 Step trigger
self.target_net.load_state_dict(self.eval_net.state_dict()) # Assign the parameters of the evaluation network to the target network
self.learn_step_counter += 1 # The number of learning steps increases by itself 1
# Extract batch data from the memory
#sampe_index:ndarray(32,)
sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE) # stay [0, 2000) Internal random sampling 32 Number , May repeat
#b_memory:ndarray(32,10) #b_s:ndarray(32,4)
b_memory = self.memory[sample_index, :] # extract 32 An index corresponds to 32 individual transition, Deposit in b_memory
b_s = torch.FloatTensor(b_memory[:, :N_STATES])
# take 32 individual s Draw out , To 32-bit floating point form , And store it to b_s in ,b_s by 32 That's ok 4 Column
b_a = torch.LongTensor(b_memory[:, N_STATES:N_STATES+1].astype(int))
# take 32 individual a Draw out , To 64-bit integer (signed) form , And store it to b_a in ( The reason is LongTensor type , It's for the convenience of the back torch.gather Use ),b_a by 32 That's ok 1 Column
b_r = torch.FloatTensor(b_memory[:, N_STATES+1:N_STATES+2])
# take 32 individual r Draw out , To 32-bit floating point form , And store it to b_s in ,b_r by 32 That's ok 1 Column
b_s_ = torch.FloatTensor(b_memory[:, -N_STATES:])
# take 32 individual s_ Draw out , To 32-bit floating point form , And store it to b_s in ,b_s_ by 32 That's ok 4 Column
# obtain 32 individual transition Evaluation value and target value , And use the loss function and optimizer to evaluate the network parameter update
q_eval1 = self.eval_net(b_s)
q_eval = q_eval1.gather(1, b_a)
# eval_net(b_s) By evaluating the network output 32 Each row b_s Corresponding series of action values , then .gather(1, b_a) Represents the corresponding index for each row b_a Of Q Value extraction for aggregation
q_next = self.target_net(b_s_).detach()
# q_next No reverse transfer error , therefore detach;q_next Indicates output through the target network 32 Each row b_s_ Corresponding series of action values
q = q_next.max(1)[0]
q_target = b_r + GAMMA * q.view(BATCH_SIZE,1)
# q_next.max(1)[0] Indicates that only the maximum value of each row is returned , Do not return index ( The length is 32 One dimensional tensor of );.view() It means to change the one-dimensional tensor obtained above into (BATCH_SIZE, 1) The shape of the ; Finally, the target value is obtained through the formula
loss = self.loss_func(q_eval, q_target)
# Input 32 Evaluation value and 32 Target value , Use the mean square loss function
self.optimizer.zero_grad() # Clear the residual update parameter value of the previous step
loss.backward() # Error back propagation , Calculate parameter update values
self.optimizer.step() # Update all parameters of the evaluation network
dqn = DQN() # Make dqn=DQN class
rewards = []
for i in range(400): # 400 individual episode loop
print('<<<<<<<<<Episode: %s' % i)
s = env.reset() # Reset environment
episode_reward_sum = 0 # Initialize... Corresponding to this loop episode The total reward for
while True: # Start a episode ( Each cycle represents a step )
env.render() # Show experimental animation
a = dqn.choose_action(s) # Enter the status corresponding to this step s, Choose action
s_, r, done, info = env.step(a) # Executive action , Get feedback
# Modify the award ( You can do it without modification , Modify the reward just to get the trained swing faster )
x, x_dot, theta, theta_dot = s_
r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8
r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
new_r = r1 + r2
dqn.store_transition(s, a, new_r, s_) # Store samples
episode_reward_sum += new_r # Gradually add one episode Each inside step Of reward
s = s_ # Update status
if dqn.memory_counter > MEMORY_CAPACITY: # If cumulative transition The number exceeds the fixed capacity of the memory 2000
# Begin to learn ( Extract memory , namely 32 individual transition, And update the evaluation network parameters , And after starting to study every 100 Assign the parameters of the evaluation network to the target network )
dqn.learn()
if done: # If done by True
# round() Method returns episode_reward_sum The decimal point of is rounded to 2 A digital
print('episode%s---reward_sum: %s' % (i, round(episode_reward_sum, 2)))
rewards.append(episode_reward_sum)
break
边栏推荐
- Mortal immortal cultivation pointer-1
- FAQs and answers to the imitation Niuke technology blog project (II)
- 实验五 类和对象
- (original) make an electronic clock with LCD1602 display to display the current time on the LCD. The display format is "hour: minute: Second: second". There are four function keys K1 ~ K4, and the fun
- Brief introduction to XHR - basic use of XHR
- 1. Preliminary exercises of C language (1)
- Difference and understanding between detected and non detected anomalies
- 7-3 构造散列表(PTA程序设计)
- Safe driving skills on ice and snow roads
- 【Numpy和Pytorch的数据处理】
猜你喜欢
Using spacedesk to realize any device in the LAN as a computer expansion screen
1. First knowledge of C language (1)
Mortal immortal cultivation pointer-2
仿牛客技术博客项目常见问题及解答(二)
透彻理解LRU算法——详解力扣146题及Redis中LRU缓存淘汰
Leetcode. 3. Longest substring without repeated characters - more than 100% solution
[during the interview] - how can I explain the mechanism of TCP to achieve reliable transmission
3. C language uses algebraic cofactor to calculate determinant
Relationship between hashcode() and equals()
8. C language - bit operator and displacement operator
随机推荐
Canvas foundation 2 - arc - draw arc
Brief introduction to XHR - basic use of XHR
9. Pointer (upper)
编写程序,模拟现实生活中的交通信号灯。
受检异常和非受检异常的区别和理解
(original) make an electronic clock with LCD1602 display to display the current time on the LCD. The display format is "hour: minute: Second: second". There are four function keys K1 ~ K4, and the fun
2. First knowledge of C language (2)
Inaki Ading
The latest tank battle 2022 - Notes on the whole development -2
关于双亲委派机制和类加载的过程
【黑马早报】上海市监局回应钟薛高烧不化;麦趣尔承认两批次纯牛奶不合格;微信内测一个手机可注册俩号;度小满回应存款变理财产品...
FAQs and answers to the imitation Niuke technology blog project (II)
【九阳神功】2022复旦大学应用统计真题+解析
实验四 数组
【MySQL数据库的学习】
【毕业季·进击的技术er】再见了,我的学生时代
[the Nine Yang Manual] 2022 Fudan University Applied Statistics real problem + analysis
C language Getting Started Guide
【九阳神功】2021复旦大学应用统计真题+解析
仿牛客技术博客项目常见问题及解答(三)