当前位置：网站首页>Applied practical skills of deep reinforcement learning

Applied practical skills of deep reinforcement learning

2022-07-29 10:54:00 【CHH3213】

List of articles

Reference material
1. How to apply deep reinforcement learning
2. Implementation phase
3. Training and commissioning stage

Reference material

《 Deep reinforcement learning 》 Book number 18 Chapter

This blog is summarized from 《 Deep reinforcement learning 》 Book number 18 Chapter .

1. How to apply deep reinforcement learning

In intensive learning , Because the basic process of reinforcement learning requires agents to learn from reward signals rather than labels in the dynamic process of interacting with the environment , This is different from supervised learning .
The reward function in reinforcement learning may only contain incomplete or partial information , And agents use bootstrapping （ Bootstrapping） When learning methods, we often pursue a changing goal .
Besides , More than one deep neural network is often used in deep reinforcement learning , Especially in those higher or recently proposed methods . This makes the deep reinforcement learning algorithm may be unstable and sensitive to hyperparameters .
Reinforcement learning can be used in continuous decision-making problems , And this kind of problem can usually use Markov （ Markov） Process to describe or approximate . A prediction task with labeled data usually does not need reinforcement learning algorithm , The supervised learning method may be more direct and effective .
Reinforcement learning tasks usually include at least two key elements ：
- Environmental Science , Used to provide dynamic processes and reward signals ;
- agent , Controlled by a strategy , This strategy is obtained through intensive learning and training .

The application of deep reinforcement learning algorithm has the following stages .

1.1 Simple test phase

It is necessary to use a model with high confidence in its correctness and accuracy , Including reinforcement learning algorithm .
If it is a new task , Use it to explore the environment （ Even use a random strategy ） Or gradually verify the extension that will be made on the final model , Instead of directly using a complex model .
need Experiment quickly to detect Possible problems in the basic settings of environment and model , Or at least familiarize yourself with the task to be solved , This will give you some inspiration in the later process , Sometimes, some extreme situations that need to be considered will be exposed .

1.2 Quick configuration phase

The model settings should be quickly tested , To assess the possibility of its success . If there is an error , Visualize the learning process as much as possible , And use some statistical variables when you can't get the potential relationship directly from the number （ variance 、 mean value 、 Average difference 、 Minima, etc ）.

1.3 Deployment training phase

After carefully confirming the correctness of the model , You can start large-scale deployment training .
Because deep reinforcement learning often requires a large number of samples to train for a long time , It is recommended to use parallel training 、 Use cloud server （ If you don't have a server yourself ） etc. , To speed up the large-scale training of the final model .

2. Implementation phase

Implement some basic reinforcement learning algorithms from scratch .
Properly implement the details of the paper . When you implement these methods , Don't over fit to the details of the paper , Instead, understand why the author chooses to use these techniques in these specific situations .
If you solve a specific task , First explore the environment .
Check the details of the environment , Including the amount of observation and the nature of the action , Such as dimension 、 range 、 Continuous or discrete value type . If the value of environmental observations is within a large effective range or unknown range , Then its value should be normalized . such as , If you use Tanh perhaps Sigmoid As an activation function , Larger input values may saturate the nodes of the first hidden layer , After the training starts, it will lead to smaller gradient value and slower learning speed .
Select an appropriate output activation function for each network .
An appropriate output activation function should be selected for the action network according to the environment . such as , Common image ReLU It may work well for hidden layers in terms of calculation time and convergence performance , However, it may not be appropriate for the action output range with negative values . It is best to match the range of policy output value with the action value range of the environment , For example, for the action range (−1, 1) Use... In the output layer Tanh Activation function .
Start with simple examples and gradually increase the complexity .
= Start with the dense reward function ==.
The design of reward function can affect the convexity of optimization problem in the learning process , Therefore, we should start with a smooth and dense reward function .
Choose the right network structure .
- For deep reinforcement learning , Neural network depth is usually not too deep , exceed 5 Layer neural networks are not particularly common in reinforcement learning applications . This is due to the computational complexity of reinforcement learning algorithm itself .
- In supervised learning , If the network is large enough compared to the data , It can be over fitted to the data set , In deep reinforcement learning , It may only slowly converge or even diverge , This is because of the strong correlation between exploration and utilization .
- The choice of network size is often based on the environment state space and action space . A discrete environment with dozens of state action combinations may use a tabular method , Or a single-layer or two-layer neural network .
- For the structure of the network , Multilayer perceptron is very common in the literature （ Multi-Layer Perceptrons, MLPs）、 Convolutional neural networks （ CNNs） And recurrent neural networks （ RNNs）.
- A low dimensional vector input can be processed by a multilayer perceptron , Vision based strategies often require a convolutional neural network backbone to extract information in advance , Or train with reinforcement learning algorithm , Or use other computer vision methods for pre training . There are other situations , For example, low dimensional vector input and high dimensional image input are used together , In practice, it is usually used to extract the backbone of features from high-dimensional input and then connect it in parallel with other low-dimensional inputs . Cyclic neural networks can be used in environments that are not completely observable or non Markov processes , The best action choice depends not only on the current state , And depend on the previous state .
Be familiar with the nature of the reinforcement learning algorithm you use .
- for instance , image PPO or TRPO Trust region based methods of classes may require large batch size To ensure the progress of security strategy . For these trust region methods , We usually expect the strategy to show steady progress , Instead of a sudden big drop in some positions on the learning curve . TRPO The equal trust region method requires a larger batch size The reason is that , It needs to be approximated by conjugate gradients Fisher Information matrix , This is calculated based on the batch samples currently sampled . If batch size Too small or biased , This approximation may cause problems , And lead to Fisher Information matrix （ Or reverse Hessian The product of ） The inaccuracy of the approximation makes the learning performance decline . therefore , In practice , Algorithm TRPO and PPO Medium batch size Need to be increased , Until the agent has stable and progressive learning performance .TRPO Sometimes it cannot be well extended to large-scale networks or deep convolutional neural networks and cyclic Neural Networks .
- DDPG Algorithms are usually considered to be sensitive to hyperparameters , Although it has proved effective for many continuous action space tasks . When applied to large-scale or real-world tasks , This sensitivity will be more significant . such as , Although in a simple simulation test environment, a thorough super parameter search can finally find an optimal performance effect , However, due to time and resource constraints, the learning process in the real world may not allow this kind of hyperparametric search , therefore DDPG Compared with others TRPO or SAC The algorithm may not work well . On the other hand , Even though DDPG The algorithm was originally designed to solve tasks with continuous value actions , This does not mean that it cannot work with discrete value actions . If you try to apply it to tasks with discrete value actions , Then you need to use some extra skills , For example, use a larger t It's worth it Sigmoid(tx) Output the activation function and prune it to binary output , We must also ensure that the truncation error is relatively small , Or you can use it directly Gumbel-Softmax Technique to change the distribution of deterministic output into one category . Other algorithms can also have similar processing .
Normalized value processing .
- Normalize the value of the reward function by scaling rather than changing the mean , And standardize the predicted target value of the value function in the same way .
- The scaling of the reward function is based on the batch samples sampled in the training . Only do value scaling （ That is, divide by the standard deviation ） Without mean shift （ Subtract the statistical mean to get the zero mean ） The reason is that , Mean shift may affect the willingness of agents to survive . This is actually related to the sign of the whole reward function , And this conclusion only applies to your use “ Done” Signal situation . Actually , If it doesn't work in advance “ Done” Signal to terminate the segment , Then you can use the mean shift .
- Consider one of the following , If the agent experiences a segment , and “ Done=True” The signal occurs within the maximum segment length , So if we think that agents are still alive , Then this “ Done” The reward value after the signal is actually 0. If these are 0 The reward value of is generally higher than the previous reward value （ That is, the previous reward value is basically negative ）, Then intelligent experience tends to end the segment as early as possible , To maximize the rewards in the whole clip . contrary , If the previous reward function is basically positive , Intelligent experience choice “ live ” Take longer . If we adopt the mean shift method for the reward value , It will break the agent's willingness to survive in the above situation , Thus, the agent will not choose to survive longer even when the reward value is basically positive , And this will affect the performance in training .
- The goal of the normalized value function is similar . for instance , Some are based on DQN The average of the algorithm Q The value will unexpectedly increase during the learning process , And this is from the maximization optimization formula for Q Caused by overestimation of value . Normalization target Q Value can alleviate this problem , Or use other techniques such as Double Q-Learning.
Notice the divergence between the reward function and the final goal .
- Reinforcement learning is often used for a specific task with an ultimate goal , Usually, it is necessary to design a reward function that is consistent with the final goal to facilitate agent learning .
- In this sense , Reward function is a quantitative form of goal , This also means that they may be two different things . In some cases, there will be differences between them . Because an reinforcement learning agent can over fit to the reward function you set for the task , And you may find that the final strategy of training is different from what you expect in achieving the final goal . One of the most likely reasons for this is the divergence between the reward function and the ultimate goal .
- In most cases , It is easy for the reward function to tend to the final task goal , But designing a reward function is consistent with the ultimate goal in all extreme cases , Is not mediocre . What you should do is to minimize such differences , To ensure that the reward function you design can smoothly help the agent achieve the ultimate real goal .
Non Markov case .
- Non Markov decision processes and partially observable Markov processes （ POMDP） The difference is sometimes subtle . such as , If one is in Atari Pong The state in the game is defined as containing the position and speed information of the ball at the same time （ Suppose the ball moves without acceleration ）, But the observation quantity only has the position , Then the environment is POMDP Not a non Markov process . However , Pong The state of the game is usually considered as the static frame of each time step , Then the current state only contains the position of the ball without all the information that the agent can make the optimal action choice , such as , The speed and direction of the ball will also affect the optimal action . So in this case, it is a non Markov environment . One way to provide speed and direction information is to use historical states , This violates the processing method under Markov process .
- stay DQN In the original , Use Stack frames In an approximate way MDP To solve Pong Mission . If we treat all stacked frames as a single state , And it is assumed that the stacked frame can contain all the information to make the optimal action choice , Then this task actually still follows the Markov process assumption .
- Cyclic neural network or higher LSTM The method can also be used in the case of decision-making based on historical memory , To solve the problem of non Markov process .

3. Training and commissioning stage

Initialization is important .
- Deep reinforcement learning methods are usually either online strategies （ On-Policy） Method update the strategy with samples in each segment , Or use offline strategy （ Off-Policy） Dynamic playback cache in （ Replay Buffer）, This cache contains samples of diversity over time .
- This makes deep reinforcement learning different from supervised learning , Supervised learning is learning from a fixed data set , Therefore, the order of learning samples is not particularly important . However , In intensive learning , The initialization of a policy can affect the scope of possible subsequent exploration , And decide the subsequent samples stored in the cache or the samples directly used for updating , Therefore, it will affect the whole learning performance .
- Starting with a random strategy will lead to a greater probability of having more samples , This is good for the beginning of training . But with the convergence and progress of the strategy , The scope of exploration gradually narrowed , And close to the trajectory generated by the current strategy . For the initialization of weight parameters , Generally speaking, use more advanced methods such as Xavier initialization Or orthogonal initialization Will be better , In this way, gradient disappearance or gradient explosion can be avoided , And have a relatively stable learning performance for most deep learning situations .
Use multiple random seeds and calculate the average value to reduce randomness .
- Fixed random seeds can be used to reproduce the learning process . Use random seeds and get the average value of the learning curve , It can reduce the possibility of getting wrong conclusions caused by the randomness of deep reinforcement learning in experimental comparison . Usually use more random seeds , The more reliable the experimental results are , But it also increases the experimental time . Based on experience , We use different random seeds 3 To 5 A relatively reliable result can be obtained from this experiment , But the more, the better .
Balance CPU and GPU Calculate resources to speed up training .
- This tip is actually about finding and solving the bottleneck problem in training speed .
- Better use of computing resources on limited computers , Reinforcement learning is more complex than supervised learning .
- In supervised learning , CPU It is often used for data reading, writing and preprocessing , and GPU For forward reasoning and back propagation .
- However , Because the reasoning process in reinforcement learning always involves the interaction with the environment , The device that calculates the gradient needs to match the computing power of the device that interacts with the processing environment , Otherwise, it will be a waste of exploration or utilization .
- In reinforcement learning , CPU It is often used in the process of interactive sampling with the environment , This may involve a lot of computation for some complex simulation systems . GPU It is used for forward reasoning and back propagation to update the network . In the process of deploying large-scale training , Should check CPU and GPU Calculate resource utilization , Avoid threads or processes sleeping .
- about GPU Overuse , More sampling threads or processes can be used to interact with the environment .
- about CPU Overuse , You can reduce the number of distributed sampling threads , Or increase the number of distributed update threads , Increase the number of update iterations in the algorithm , For offline update, increase the batch size, etc .
visualization .
- If you can't see the potential relationship directly from the value , It should be visualized as much as possible . such as , Sometimes due to the unstable nature of the reinforcement learning process , The reward function may have a lot of jitter , In this case, it may be necessary to draw a moving average curve of the reward value to understand whether the agent has made progress in training .
Smooth the learning curve .
- The process of reinforcement learning can be very unstable . It is often unreliable to draw conclusions directly from an unprocessed learning curve , We usually use moving average 、 Convolution kernel, etc. to smooth the learning curve , And choose a suitable window length .
Question your algorithm implementation .
- Just after the code implementation , It doesn't work , It's very common , And then , It is important to debug the code patiently .
- The correctness of algorithm implementation always precedes the fine adjustment of a relatively good result , therefore , We should consider fine-tuning the super parameters on the premise of ensuring the correctness of the implementation .