当前位置:网站首页>Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay
Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay
2022-06-09 09:20:00 【Cloud fff】
- title :An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay
- The article links :An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay
- publish :NIPS 2020
- field : Reinforcement learning —— Replay Buffer
- Abstract : Priority experience replay (PER) It is a deep reinforcement learning technology , In this technology agent From non uniformly sampled transition Middle school learning ,transition The probability of being sampled is related to their TD error Is proportional to the . We proved that Based on non-uniform sampling transition Any loss function calculated can be transformed into another uniform sampling based function with the same desired gradient transition Loss function of . It's amazing , We find that in some environments ,PER It can be completely replaced by this new loss function , Without affecting the empirical performance . Besides , This relationship also suggests improvement PER A new branch of , That is, it is improved by modifying its equivalent uniform sampling loss function . We're in a couple of MuJoCo and Atari The environment proves that we are right PER And the modification of the equivalent loss function
List of articles
- 1. PER background
- 2. Methods of this paper
- 2.1 The theoretical analysis
- 2.1.1 Formal description of the problem
- 2.1.2 Theorem 1( The condition of equal expectation gradient transformation )
- 2.1.3 Theorem 2( The gradient variance can be reduced by converting to an equivalent priority replay scheme )
- 2.1.4 Theorem 3( In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading )
- 2.2 Put forward the method
- 3. experimental result
- 4. summary
1. PER background
1.1 Experience Replay
- about off-policy In terms of reinforcement learning methods , Since... Is not required target policy and behavior policy identical , We can take the past experience transition Save and reuse , There are three benefits
- Break the circular dependence of value estimation and exploration direction , send agent More stable training
- Break the correlation of short-term data , Reduce the vibration during optimization
- Increase data utilization
- In order to achieve Experience Replay, Usually add one Experience Buffer Store past transition,agent The interaction diagram with the environment is as follows

1.2 Prioritized Experience Replay
- The original Experience Replay in , Training agent The use of transition It's from Experience Buffer Uniformly sampled in , There is an implicit assumption here , Namely “agent From different transition The amount of learning is equal ”, But there is obviously something transition Is more conducive to agent Study of the , It's like supervising difficult samples and simple samples in learning , If you can emphasize these key points in the learning process transition, It may greatly improve the sample efficiency and convergence speed
- Prioritized Experience Replay It is a method based on this idea , It thinks “TD error representative agent To a transition The degree of surprise , Can be used as the transition An estimate of the learning effort ”, So the method replays those high with a higher probability TD error Of transition, Use the heavy tailed distribution to guarantee transition Diversity , And the importance sampling ratio is used to eliminate the deviation of the desired gradient direction
- About PER For a detailed introduction, please refer to : Paper understanding 【RL - Exp Replay】 —— 【PER】Prioritized Experience Replay
2. Methods of this paper
2.1 The theoretical analysis
- PER The paper is more heuristic , There is no strict theoretical derivation , It is found that sometimes PER This trick However, the performance of the algorithm is reduced , However, due to the lack of theoretical basis, it is difficult to quantitatively analyze the reasons . The author from PER cut-in , To PER In order to represent the experience replay of non-uniform priority, a detailed theoretical analysis is carried out , This is also the main contribution of this paper
- In this paper, the general result yes :“ Use a loss function L 1 \mathcal{L}_1 L1 Non uniform priority replay strategy ” The expected gradient of ( Direction ) and “ Use another loss function L 2 \mathcal{L}_2 L2 Uniform replay strategy ” They are equal. . Take advantage of this result , We can convert any non-uniform sampling playback method to another uniform sampling playback method using a specific loss function , Then the rationality of the original method is analyzed through the new loss function , As shown in the figure below

The author uses this method to analyze PER The problem is , And an improved method , This is another contribution of this paper
2.1.1 Formal description of the problem
- First, define the object of analysis and mathematical form : This paper aims at be based on Bellman equation Iterative off-policy value prediction Class method ( similar Actor-Critic In the frame Critic) Expand the analysis , In this case, we need to learn one Q Value network to estimate a Given policy π \pi π The corresponding value
Set the network parameter to θ \theta θ, Any state ( s , a ) (s,a) (s,a) The value of is estimated to be
Q θ π ( s , a ) = E π [ ∑ t = 0 ∞ γ t r t + 1 ∣ s 0 = s , a 0 = a ] = E r , s ′ ∼ p ; a ′ ∼ π [ r + γ Q θ π ( s ′ , a ′ ) ] \begin{aligned} Q_\theta^\pi(s,a) &= \mathbb{E}_\pi[\sum_{t=0}^\infin \gamma^tr_{t+1}|s_0=s,a_0=a] \\ &= \mathbb{E}_{r,s'\sim p;a'\sim \pi}[r+\gamma Q_\theta^\pi(s',a')] \end{aligned} Qθπ(s,a)=Eπ[t=0∑∞γtrt+1∣s0=s,a0=a]=Er,s′∼p;a′∼π[r+γQθπ(s′,a′)] To a slave replay buffer B \mathcal{B} B Sampling in transition i = ( s , a , r , s ′ ) i = (s,a,r,s') i=(s,a,r,s′),TD error by
δ ( i ) = Q θ ( i ) − y ( i ) = Q θ ( i ) − ( r + γ Q θ ′ ( s , a ) ) \delta(i) = Q_\theta(i) - y(i) = Q_\theta(i)-(r+\gamma Q_{\theta'}(s,a)) δ(i)=Qθ(i)−y(i)=Qθ(i)−(r+γQθ′(s,a)) Among them, the learning objectives y ( i ) = r + γ Q θ ′ ( s , a ) y(i)= r+\gamma Q_{\theta'}(s,a) y(i)=r+γQθ′(s,a) By an independent , Parameter is θ ′ \theta' θ′ The target network of , Update every certain step θ ′ ← θ \theta' \leftarrow \theta θ′←θbe based on TD error Design loss function L ( δ ( i ) ) \mathcal{L}(\delta(i)) L(δ(i)) , In a dimension of M M M Of mini-batch The average loss on is 1 M ∑ i L ( δ ( i ) ) \frac{1}{M}\sum_i \mathcal{L}(\delta(i)) M1∑iL(δ(i))
The constant here M M M Does not affect expected losses or gradients , Not important in analysis
Examine the gradients in updating the value network
▽ θ L = ∂ L ∂ θ = ∂ L ∂ δ ∂ δ ∂ Q ∂ Q ∂ θ = ∂ L ∂ Q ∂ Q ∂ θ = ▽ Q L ⋅ ▽ θ Q \triangledown_\theta \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \delta} \frac{\partial \delta}{\partial Q} \frac{\partial Q}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial Q} \frac{\partial Q}{\partial \theta} = \triangledown_Q\mathcal{L}·\triangledown_\theta Q ▽θL=∂θ∂L=∂δ∂L∂Q∂δ∂θ∂Q=∂Q∂L∂θ∂Q=▽QL⋅▽θQ among ▽ θ Q \triangledown_\theta Q ▽θQ And the only Q Q Q Network structure , It's not in our concern , We only consider the loss function ▽ Q L \triangledown_Q\mathcal{L} ▽QL term ( Be careful δ ( i ) \delta(i) δ(i) It's about Q ( i ) Q(i) Q(i) Function of )This article focuses on three types of losses
- L 1 L_1 L1 Loss : L L 1 ( δ ( i ) ) = ∣ δ ( i ) ∣ \mathcal{L}_{L1}(\delta(i)) = |\delta(i)| LL1(δ(i))=∣δ(i)∣, The gradient of ▽ Q L L 1 ( δ ( i ) ) = sign ( δ ( i ) ) \triangledown_Q\mathcal{L}_{L1}(\delta(i)) = \text{sign}(\delta(i)) ▽QLL1(δ(i))=sign(δ(i))
- MSE Loss : L MSE ( δ ( i ) ) = 0.5 δ ( i ) 2 \mathcal{L}_{\text{MSE}}(\delta(i)) = 0.5\delta(i)^2 LMSE(δ(i))=0.5δ(i)2, The gradient of ▽ Q L MSE ( δ ( i ) ) = δ ( i ) \triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i)) = \delta(i) ▽QLMSE(δ(i))=δ(i)
- Huber Loss :
L Huber = { 0.5 δ ( i ) 2 i f ∣ δ ( i ) ∣ ≤ k , k ( ∣ δ ( i ) ∣ − 0.5 k ) otherwise \mathcal{L}_{\text{Huber}} = \left\{ \begin{aligned} &0.5\delta(i)^2 && if \space\space|\delta(i)|\leq k,\\ &k(|\delta(i)|-0.5k) && \text{otherwise} \end{aligned} \right. LHuber={ 0.5δ(i)2k(∣δ(i)∣−0.5k)if ∣δ(i)∣≤k,otherwise Usually set k = 1 k=1 k=1, This is based on ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The value of ,Huber The gradient of loss is equivalent to MSE or L 1 L_1 L1 Loss . This loss function can be regarded as 0 Smoother nearby L 1 L_1 L1 Loss
2.1.2 Theorem 1( The condition of equal expectation gradient transformation )
Give the implementation 2.1 The idea of the transformation of the sub part of the section Introduction : Given transition Distribution D 1 \mathcal{D}_1 D1 and D 2 \mathcal{D}_2 D2, Loss function L 1 \mathcal{L}_1 L1 and L 2 \mathcal{L}_2 L2. For the first i i i Samples , stay D 1 \mathcal{D}_1 D1 Up sampling and use L 1 \mathcal{L}_1 L1 Expected loss gradient , By using the importance sampling ratio p D 1 ( i ) p D 2 ( i ) \frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)} pD2(i)pD1(i) From another distribution D 2 \mathcal{D}_2 D2 Upper determination

hypothesis L 2 \mathcal{L}_2 L2 The gradient of is the formula inside the right formula , namely
▽ Q L 2 ( δ ( i ) ) = p D 1 ( i ) p D 2 ( i ) ▽ Q L 1 ( δ ( i ) ) (1) \triangledown_Q\mathcal{L}_2(\delta(i)) = \frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)}\triangledown_Q\mathcal{L}_1(\delta(i)) \tag{1} ▽QL2(δ(i))=pD2(i)pD1(i)▽QL1(δ(i))(1) Meeting such conditions L 2 \mathcal{L}_2 L2 It can ensure that the expected gradients under the two distributions are equal , namely E D 1 [ ▽ Q L 1 ( δ ( i ) ) ] = E D 2 [ ▽ Q L 2 ( δ ( i ) ) ] \mathbb{E}_{\mathcal{D}_1}[\triangledown_Q\mathcal{L}_1(\delta(i))] = \mathbb{E}_{\mathcal{D}_2}[\triangledown_Q\mathcal{L}_2(\delta(i))] ED1[▽QL1(δ(i))]=ED2[▽QL2(δ(i))]for instance , If you put D 1 \mathcal{D}_1 D1 Defined as a finite set B \mathcal{B} B Even distribution on U \mathcal{U} U, D 2 \mathcal{D}_2 D2 Defined as the priority sampling distribution p ( i ) = ∣ δ ( i ) ∣ ∑ j ∈ B ∣ δ ( j ) ∣ p(i) = \frac{|\delta(i)|}{\sum_{j\in\mathcal{B}}|\delta(j)|} p(i)=∑j∈B∣δ(j)∣∣δ(i)∣, be MSE and L 1 L_1 L1 There are the following relationships
E U [ ▽ Q L MSE ( δ ( i ) ) ] = E D 2 [ p D 1 ( i ) p D 2 ( i ) ▽ Q L MSE ( δ ( i ) ) ] = E D 2 [ 1 N ∑ j ∣ δ ( i ) ∣ ∣ δ ( i ) ∣ δ ( i ) ] = E D 2 [ ∑ j ∣ δ ( i ) ∣ N δ ( i ) ∣ δ ( i ) ∣ ] = E D 2 [ ∑ j ∣ δ ( i ) ∣ N sign ( δ ( i ) ) ] ∝ E D 2 [ sign ( δ ( i ) ) ] = E D 2 [ ▽ Q L L 1 ( δ ( i ) ) ] \begin{aligned} \mathbb{E}_\mathcal{U}\big[\triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i)) \big] &= \mathbb{E}_{\mathcal{D}_2}\big[\frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)}\triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i))\big] \\ &=\mathbb{E}_{\mathcal{D}_2}\big[ \frac{1}{N}\frac{\sum_j|\delta(i)|}{|\delta(i)|}\delta(i) \big] \\ &= \mathbb{E}_{\mathcal{D}_2}\big[\frac{\sum_j|\delta(i)|}{N} \frac{\delta(i)}{|\delta(i)|}\big] \\ &=\mathbb{E}_{\mathcal{D}_2}\big[\frac{\sum_j|\delta(i)|}{N}\text{sign}(\delta(i))\big] \\ &\propto\mathbb{E}_{\mathcal{D}_2}\big[\text{sign}(\delta(i))\big] \\ &= \mathbb{E}_\mathcal{\mathcal{D}_2}\big[\triangledown_Q\mathcal{L}_{\text{L}_1}(\delta(i)) \big] \end{aligned} EU[▽QLMSE(δ(i))]=ED2[pD2(i)pD1(i)▽QLMSE(δ(i))]=ED2[N1∣δ(i)∣∑j∣δ(i)∣δ(i)]=ED2[N∑j∣δ(i)∣∣δ(i)∣δ(i)]=ED2[N∑j∣δ(i)∣sign(δ(i))]∝ED2[sign(δ(i))]=ED2[▽QLL1(δ(i))] It means “ Use priority sampling for playback L 1 L_1 L1 Loss ” and “ Playback using uniform sampling MSE Loss ” Have the same desired gradient direction . Also note the... In the fourth equal sign here coefficient ∑ j ∣ δ ( i ) ∣ N = ∑ j p r ( j ) N \frac{\sum_j|\delta(i)|}{N} = \frac{\sum_j pr(j)}{N} N∑j∣δ(i)∣=N∑jpr(j)( p r ( i ) pr(i) pr(i) On behalf of the i i i Sample priority ), This coefficient is a bridge between uniform distribution and priority distribution , In the following analysis, there will often beTheorem 1: Given size is N N N Data set of B \mathcal{B} B, Loss L 1 , L 2 \mathcal{L}_1,\mathcal{L}_2 L1,L2 And some priority is p r pr pr Priority distribution of ( Sample i i i The probability of being selected is p r ( i ) ∑ j p r ( j ) \frac{pr(i)}{\sum_jpr(j)} ∑jpr(j)pr(i)), if ▽ Q L 1 ( δ ( i ) ) = 1 λ p r ( i ) ▽ Q L 2 ( δ ( i ) ) \triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i)) ▽QL1(δ(i))=λ1pr(i)▽QL2(δ(i)) among λ = ∑ j p r ( j ) N \lambda = \frac{\sum_jpr(j)}{N} λ=N∑jpr(j), From B \mathcal{B} B in Uniformly sampled samples i i i Corresponding L 1 ( δ ( i ) ) \mathcal{L}_1(\delta(i)) L1(δ(i)) The expected gradient of , And from the B \mathcal{B} B in Press p r pr pr Preferred samples i i i Corresponding L 2 ( δ ( i ) ) \mathcal{L}_2(\delta(i)) L2(δ(i)) The expected gradients of are equal
Proof:
E i ∼ B [ ▽ Q L 1 ( δ ( i ) ) ] = 1 N ∑ i ▽ Q L 1 ( δ ( i ) ) = 1 N ∑ i N ∑ j p r ( j ) p r ( i ) ▽ Q L 2 ( δ ( i ) ) ( false set up strip Pieces of ) = ∑ i p r ( i ) ∑ j p r ( j ) ▽ Q L 2 ( δ ( i ) ) = E i ∼ p r [ ▽ Q L 2 ( δ ( i ) ) ] \begin{aligned} \mathbb{E}_{i\sim\mathcal{B}}[\triangledown_Q\mathcal{L}_1(\delta(i))] &= \frac{1}{N}\sum_i\triangledown_Q\mathcal{L}_1(\delta(i))\\ &= \frac{1}{N}\sum_i\frac{N}{\sum_jpr(j)}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i))\space\space\space\space\space\space( Assumptions )\\ &=\sum_i\frac{pr(i)}{\sum_jpr(j)}\triangledown_Q\mathcal{L}_2(\delta(i)) \\ &=\mathbb{E}_{i\sim pr}[\triangledown_Q\mathcal{L}_2(\delta(i))] \end{aligned} Ei∼B[▽QL1(δ(i))]=N1i∑▽QL1(δ(i))=N1i∑∑jpr(j)Npr(i)▽QL2(δ(i)) ( false set up strip Pieces of )=i∑∑jpr(j)pr(i)▽QL2(δ(i))=Ei∼pr[▽QL2(δ(i))] Another better way to understand it is to put λ \lambda λ Plug in , There is ▽ Q L 1 ( δ ( i ) ) = p r ( i ) ∑ j p r ( j ) N ▽ Q L 2 ( δ ( i ) ) = p D 2 ( i ) p D 1 ( i ) ▽ Q L 2 ( δ ( i ) ) \triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{pr(i)}{\sum_jpr(j)}N\triangledown_Q\mathcal{L}_2(\delta(i)) = \frac{p_{\mathcal{D}_2}(i)}{p_{\mathcal{D}_1}(i)}\triangledown_Q\mathcal{L}_2(\delta(i)) ▽QL1(δ(i))=∑jpr(j)pr(i)N▽QL2(δ(i))=pD1(i)pD2(i)▽QL2(δ(i)) It is found that the equation (1), From this, we can obtainTheorem 1 Is the main result of this paper , It says : Guarantee “ Uniformly sampled L 1 \mathcal{L}_1 L1 Loss ” and “ According to a certain priority scheme p r pr pr Sampled by L 2 \mathcal{L}_2 L2 Loss ” The condition with the same expected gradient is ▽ Q L 1 ( δ ( i ) ) = 1 λ p r ( i ) ▽ Q L 2 ( δ ( i ) ) \triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i)) ▽QL1(δ(i))=λ1pr(i)▽QL2(δ(i))
2.1.2.1 inference 1( Construct uniform playback loss with equal expected gradient L 1 \mathcal{L}_1 L1 Methods )
Corollary 1: if L 1 ( δ ( i ) ) = 1 λ ∣ p r ( i ) ∣ × L 2 ( δ ( i ) ) \mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i)) L1(δ(i))=λ1∣pr(i)∣×L2(δ(i)) for all i i i, among λ = ∑ j p r ( j ) N \lambda = \frac{\sum_jpr(j)}{N} λ=N∑jpr(j), ∣ ⋅ ∣ × |·|_\times ∣⋅∣× yes Stop the gradient Symbol , be Theorem 1 Loss of arbitrary uniform sampling L 1 \mathcal{L}_1 L1 And press any by priority p r pr pr Loss of nonuniform sampling L 2 \mathcal{L}_2 L2 All set up
proof:
▽ Q L 1 ( δ ( i ) ) = ▽ Q 1 λ ∣ p r ( i ) ∣ × L 2 ( δ ( i ) ) = 1 λ p r ( i ) ▽ Q L 2 ( δ ( i ) ) \begin{aligned} \triangledown_Q\mathcal{L}_1(\delta(i)) &= \triangledown_Q\frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i)) \\ &= \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i)) \end{aligned} ▽QL1(δ(i))=▽Qλ1∣pr(i)∣×L2(δ(i))=λ1pr(i)▽QL2(δ(i)) Satisfy Theorem 1 Conditions of establishment , Obtain evidenceThe inference tells us , Given press p r pr pr Non uniform replay of priority and corresponding L 2 \mathcal{L}_2 L2 When it's lost , How to construct the loss function of equal expected gradient in uniform playback L 1 \mathcal{L}_1 L1
2.1.2.2 inference 2( Construct equal expectation gradient priority replay priority p r pr pr And the corresponding loss function λ L 2 \lambda\mathcal{L}_2 λL2 Methods )
Corollary 2: if sign ( ▽ Q L 1 ( δ ( i ) ) ) = sign ( ▽ Q L 2 ( δ ( i ) ) ) \text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i))) sign(▽QL1(δ(i)))=sign(▽QL2(δ(i))) And p r ( i ) = ▽ Q L 1 ( δ ( i ) ) ▽ Q L 2 ( δ ( i ) ) pr(i) = \frac{\triangledown_Q \mathcal{L}_1(\delta(i))}{\triangledown_Q \mathcal{L}_2(\delta(i))} pr(i)=▽QL2(δ(i))▽QL1(δ(i)) for all i i i, be Theorem 1 Loss of arbitrary uniform sampling L 1 \mathcal{L}_1 L1 And any press p r pr pr Loss of nonuniform sampling λ L 2 \lambda\mathcal{L}_2 λL2 All set up , among λ = ∑ j p r ( j ) N \lambda = \frac{\sum_jpr(j)}{N} λ=N∑jpr(j)
proof: Given sign ( ▽ Q L 1 ( δ ( i ) ) ) = sign ( ▽ Q L 2 ( δ ( i ) ) ) \text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i))) sign(▽QL1(δ(i)))=sign(▽QL2(δ(i)))
1 λ p r ( i ) ▽ Q λ L 2 ( δ ( i ) ) = λ λ ▽ Q L 1 ( δ ( i ) ) ▽ Q L 2 ( δ ( i ) ) ▽ Q L 2 ( δ ( i ) ) = ▽ Q L 1 ( δ ( i ) ) \begin{aligned} \frac{1}{\lambda}pr(i)\triangledown_Q\lambda\mathcal{L}_2(\delta(i)) &= \frac{\lambda}{\lambda}\frac{\triangledown_Q\mathcal{L}_1(\delta(i))}{\triangledown_Q\mathcal{L}_2(\delta(i))}\triangledown_Q\mathcal{L}_2(\delta(i)) \\ &= \triangledown_Q\mathcal{L}_1(\delta(i)) \end{aligned} λ1pr(i)▽QλL2(δ(i))=λλ▽QL2(δ(i))▽QL1(δ(i))▽QL2(δ(i))=▽QL1(δ(i)) Satisfy Theorem 1 Conditions of establishment , Obtain evidenceSince the sampling priority cannot be negative or 0, There must be sign ( p r ( i ) ) = sign ( ▽ Q L 1 ( δ ( i ) ) ▽ Q L 2 ( δ ( i ) ) ) = 1 \text{sign}(pr(i)) = \text{sign}(\frac{\triangledown_Q \mathcal{L}_1(\delta(i))}{\triangledown_Q \mathcal{L}_2(\delta(i))})=1 sign(pr(i))=sign(▽QL2(δ(i))▽QL1(δ(i)))=1, Therefore, conditions need to be met sign ( ▽ Q L 1 ( δ ( i ) ) ) = sign ( ▽ Q L 2 ( δ ( i ) ) ) \text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i))) sign(▽QL1(δ(i)))=sign(▽QL2(δ(i))). Usually , All designed to minimize Q Q Q The loss function of the distance between the output and the given target satisfies this condition
Because we have the same goal , The optimization direction is the same , The gradient direction should also be approximately the same
In this case, the function of non-uniform sampling is similar to that of importance sampling , It's right L 2 \mathcal{L}_2 L2 Do reweighting to match L 1 \mathcal{L}_1 L1 The expected gradient of
The inference tells us , Given by uniform playback corresponding to L 1 \mathcal{L}_1 L1 Loss and another loss function L 2 \mathcal{L}_2 L2 when , How to construct a non-uniform replay priority that guarantees equal expected gradients p r pr pr And the corresponding loss function λ L 2 \lambda\mathcal{L}_2 λL2
for instance , When L 2 \mathcal{L}_2 L2 yes L 1 L_1 L1 When it's lost , Because of its gradient ▽ Q L L 1 ( δ ( i ) ) = ± 1 \triangledown_Q\mathcal{L}_{L_1}(\delta(i)) = \pm 1 ▽QLL1(δ(i))=±1, hold p r pr pr Set to p r ( i ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ pr(i) = |\triangledown_Q\mathcal{L}_1(\delta(i))| pr(i)=∣▽QL1(δ(i))∣, You can take even samples L 1 \mathcal{L}_1 L1 Loss conversion to priority sampling mechanism . The fact proved that , This reduces the variance of the gradient (2.1.2.1 section Observation 1)
2.1.3 Theorem 2( The gradient variance can be reduced by converting to an equivalent priority replay scheme )
2.1.3.1 Observe 1( The variance decreases )
Observation 1: Given size is N N N Data set of B \mathcal{B} B And the loss function L 1 \mathcal{L}_1 L1, sample i i i In priority p r ( i ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ pr(i) = |\triangledown_Q\mathcal{L}_1(\delta(i))| pr(i)=∣▽QL1(δ(i))∣ Take priority sampling , Yes λ = ∑ j p r ( j ) N \lambda = \frac{\sum_j pr(j)}{N} λ=N∑jpr(j), be λ L L 1 ( δ ( i ) ) \lambda\mathcal{L}_{L_1}(\delta{(i)}) λLL1(δ(i)) The variance of the gradient of is less than or equal to that of uniformly sampled L 1 ( δ ( i ) ) \mathcal{L}_1(\delta(i)) L1(δ(i)) The variance of the gradient of
This observation means , If set L 2 = L 1 \mathcal{L}_2 = L_1 L2=L1, For any uniform playback loss L 1 \mathcal{L}_1 L1, By inference 2 Construct the priority of priority replay p r pr pr And corresponding priority losses λ L 2 \lambda\mathcal{L}_2 λL2, Can reduce mini-batch The variance of the gradient
This observation is mainly used to derive the theorem 2, Therefore, the detailed proof will not be given ( The following theorem 2 After proof , You can use the theorem directly 2 Prove this observation )
2.1.3.2 Theorem 2( The gradient variance can be reduced by converting to an equivalent priority replay scheme )
Theorem 2: Given size is N N N Data set of B \mathcal{B} B And the loss function L 1 , L 2 \mathcal{L}_1,\mathcal{L}_2 L1,L2, utilize Corollary 2 Priority schemes can be designed p r pr pr And the corresponding loss function λ L 2 \lambda \mathcal{L}_2 λL2( among λ = ∑ j p r ( j ) N \lambda = \frac{\sum_j pr(j)}{N} λ=N∑jpr(j)), bring Theorem 1 establish . When L 2 = L L 1 \mathcal{L}_2 =\mathcal{L}_{L_1} L2=LL1 And press Corollary 2 Design p r ( i ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ pr(i) = |\triangledown_Q\mathcal{L}_1(\delta(i))| pr(i)=∣▽QL1(δ(i))∣ when , ▽ Q λ L 2 ( δ ( i ) \triangledown_Q\lambda\mathcal{L}_2(\delta(i) ▽QλL2(δ(i) In all loss functions with the same expected gradient ( And the corresponding priority scheme ) Medium minimum
Proof:
- Consider the variance of the priority sampling gradient ( Note that the variance formula is V a r ( x ) = E [ x 2 ] − E [ x ] 2 Var(x) = \mathbb{E}[x^2]-\mathbb{E}[x]^2 Var(x)=E[x2]−E[x]2)
Var ( ∇ Q λ L 2 ( δ ( i ) ) ) = E i ∼ p r [ ( ∇ Q λ L 2 ( δ ( i ) ) ) 2 ] − E i ∼ p r [ ∇ Q λ L 2 ( δ ( i ) ) ] 2 = ∑ i p r ( i ) ∑ j p r ( j ) ( ∑ j p r ( j ) ) 2 N 2 ( ∇ Q L 2 ( δ ( i ) ) ) 2 − X = ∑ j p r ( j ) N 2 ∑ i ∇ Q L 1 ( δ ( i ) ) ∇ Q L 2 ( δ ( i ) ) − X (2) \begin{aligned} \operatorname{Var}\left(\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right) &=\mathbb{E}_{i \sim p r}\left[\left(\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right)^{2}\right]-\mathbb{E}_{i \sim p r}\left[\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right]^{2} \\ &=\sum_{i} \frac{p r(i)}{\sum_{j} p r(j)} \frac{\left(\sum_{j} p r(j)\right)^{2}}{N^{2}}\left(\nabla_{Q} \mathcal{L}_{2}(\delta(i))\right)^{2}-X \\ &=\frac{\sum_{j} p r(j)}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) \nabla_{Q} \mathcal{L}_{2}(\delta(i))-X \end{aligned} \tag{2} Var(∇QλL2(δ(i)))=Ei∼pr[(∇QλL2(δ(i)))2]−Ei∼pr[∇QλL2(δ(i))]2=i∑∑jpr(j)pr(i)N2(∑jpr(j))2(∇QL2(δ(i)))2−X=N2∑jpr(j)i∑∇QL1(δ(i))∇QL2(δ(i))−X(2) among X = E i ∼ B [ ▽ Q L 1 ( δ ( i ) ) ] 2 = E i ∼ p r [ ▽ Q L 2 ( δ ( i ) ) ] 2 X = \mathbb{E}_{i\sim\mathcal{B}}[\triangledown_Q\mathcal{L}_1(\delta(i))]^2 = \mathbb{E}_{i\sim pr}[\triangledown_Q\mathcal{L}_2(\delta(i))]^2 X=Ei∼B[▽QL1(δ(i))]2=Ei∼pr[▽QL2(δ(i))]2 Is the square of the unbiased expected gradient - Be careful sign ( ▽ Q L 1 ( δ ( i ) ) ) = sign ( ▽ Q L 2 ( δ ( i ) ) ) \text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i))) sign(▽QL1(δ(i)))=sign(▽QL2(δ(i))), Make L 2 = L L 1 \mathcal{L}_2 = \mathcal{L}_{L_1} L2=LL1, Yes ▽ Q L 2 ( δ ( i ) ) = ± 1 \triangledown_Q\mathcal{L}_2(\delta(i)) = \pm 1 ▽QL2(δ(i))=±1, be ▽ Q L 1 ( δ ( i ) ) ▽ Q L 2 ( δ ( i ) ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ \triangledown_Q\mathcal{L}_1(\delta(i))\triangledown_Q\mathcal{L}_2(\delta(i)) = |\triangledown_Q\mathcal{L}_1(\delta(i))| ▽QL1(δ(i))▽QL2(δ(i))=∣▽QL1(δ(i))∣. set up p r ( i ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ pr(i) = |\triangledown_Q\mathcal{L}_1(\delta(i))| pr(i)=∣▽QL1(δ(i))∣, The above formula can further simplify
= ∑ j ∣ ▽ Q L 1 ( δ ( j ) ) ∣ N 2 ∑ i ∣ ▽ Q L 1 ( δ ( i ) ) ∣ − X = ( ∑ j ∣ ▽ Q L 1 ( δ ( j ) ) ∣ N ) 2 − X (3) \begin{aligned} &= \frac{\sum_j|\triangledown_Q\mathcal{L}_1(\delta(j))|}{N^2}\sum_i|\triangledown_Q\mathcal{L}_1(\delta(i))|-X \\ &= \Big(\frac{\sum_j|\triangledown_Q\mathcal{L}_1(\delta(j))|}{N}\Big)^2 - X \end{aligned} \tag{3} =N2∑j∣▽QL1(δ(j))∣i∑∣▽QL1(δ(i))∣−X=(N∑j∣▽QL1(δ(j))∣)2−X(3) - Consider the general situation , ▽ Q L 2 ( δ ( i ) ) = f ( δ ( i ) ) \triangledown_Q\mathcal{L}_2(\delta(i)) = f(\delta(i)) ▽QL2(δ(i))=f(δ(i)), To ensure that the desired gradient is consistent , according to Theorem 1 Must be set to p r ( i ) = ▽ Q L 1 ( δ ( i ) f ( δ ( i ) ) pr(i) = \frac{\triangledown_Q\mathcal{L}_1(\delta(i)}{f(\delta(i))} pr(i)=f(δ(i))▽QL1(δ(i), To calculate the variance , Bring this item into the formula (2)
= ∑ j p r ( j ) N 2 ∑ i ∇ Q L 1 ( δ ( i ) ) ∇ Q L 2 ( δ ( i ) ) − X = ∑ j ∇ Q L 1 ( δ ( j ) ) / f ( δ ( j ) ) N 2 ∑ i ∇ Q L 1 ( δ ( i ) ) f ( δ ( i ) ) − X . (4) \begin{aligned} &=\frac{\sum_{j} p r(j)}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) \nabla_{Q} \mathcal{L}_{2}(\delta(i))-X \\ &=\frac{\sum_{j} \nabla_{Q} \mathcal{L}_{1}(\delta(j)) / f(\delta(j))}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) f(\delta(i))-X . \end{aligned} \tag{4} =N2∑jpr(j)i∑∇QL1(δ(i))∇QL2(δ(i))−X=N2∑j∇QL1(δ(j))/f(δ(j))i∑∇QL1(δ(i))f(δ(i))−X.(4) And then choose ( I don't know how to come up with this structure )
u j = ▽ Q L 1 ( δ ( i ) f ( δ ( i ) ) N , v j = ▽ Q L 1 ( δ ( i ) f ( δ ( i ) ) N u_j = \frac{\sqrt{\frac{\triangledown_Q\mathcal{L}_1(\delta(i)}{f(\delta(i))}}}{\sqrt{N}}, v_j = \frac{\sqrt{\triangledown_Q\mathcal{L}_1(\delta(i)f(\delta(i))}}{\sqrt{N}} uj=Nf(δ(i))▽QL1(δ(i),vj=N▽QL1(δ(i)f(δ(i))
according to Cauchy-Schwarz inequality : ( x , y ) 2 ≤ ( x , x ) ( y , y ) (\pmb{x},\pmb{y})^2\leq (\pmb{x},\pmb{x})(\pmb{y},\pmb{y}) (xxx,yyy)2≤(xxx,xxx)(yyy,yyy) , namely ( ∑ i a i b i ) 2 ≤ ∑ i a 2 ∑ i b 2 (\sum_ia_ib_i)^2\leq \sum_ia^2\sum_ib^2 (∑iaibi)2≤∑ia2∑ib2 , Yes
( ∑ j ∣ ∇ Q L 1 ( δ ( j ) ) ∣ N ) 2 ≤ ∑ j ∇ Q L 1 ( δ ( j ) ) / f ( δ ( j ) ) N 2 ∑ i ∇ Q L 1 ( δ ( i ) ) f ( δ ( i ) ) (5) \left(\frac{\sum_{j}\left|\nabla_{Q} \mathcal{L}_{1}(\delta(j))\right|}{N}\right)^{2} \leq \frac{\sum_{j} \nabla_{Q} \mathcal{L}_{1}(\delta(j)) / f(\delta(j))}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) f(\delta(i)) \tag{5} (N∑j∣∇QL1(δ(j))∣)2≤N2∑j∇QL1(δ(j))/f(δ(j))i∑∇QL1(δ(i))f(δ(i))(5) When f ( δ ( j ) ) = ± c f(\delta(j)) = \pm c f(δ(j))=±c Constant equals sign holds . So when L 2 \mathcal{L}_2 L2 yes L 1 L_1 L1 When it's lost , λ L 2 \lambda\mathcal{L}_2 λL2 Minimizing the variance of
- Consider the variance of the priority sampling gradient ( Note that the variance formula is V a r ( x ) = E [ x 2 ] − E [ x ] 2 Var(x) = \mathbb{E}[x^2]-\mathbb{E}[x]^2 Var(x)=E[x2]−E[x]2)
Theorem 2 explain : For any loss function in the case of uniform playback L 1 \mathcal{L}_1 L1, Can keep the desired gradient constant , Convert it to a priority playback scheme p r ( i ) = ∣ ▽ Q L 1 ( δ ( i ) ) ∣ pr(i) = |\triangledown_Q\mathcal{L}_1(\delta(i))| pr(i)=∣▽QL1(δ(i))∣ Under the λ L L 1 \lambda \mathcal{L}_{L_1} λLL1 Loss , To reduce the variance of the gradient
2.1.4 Theorem 3( In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading )
- To analyze PER Characteristics of , Can pass Theorem 1 The equivalent loss function in the case of uniform sampling is derived and analyzed . original PER And are used in the paper DQN same Huber Loss , First, the generalized PER Combined losses 1 τ ∣ δ ( i ) ∣ τ ( τ > 0 ) \frac{1}{\tau}|\delta(i)|^\tau \space (\tau>0) τ1∣δ(i)∣τ (τ>0) General results when used together
2.1.4.1 Theorem 3( In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading )
- Theorem 3 : When and PER When used together , Loss 1 τ ∣ δ ( i ) ∣ τ \frac{1}{\tau}|\delta(i)|^\tau τ1∣δ(i)∣τ( τ > 0 \tau>0 τ>0) The expected gradient of is equal to the expected gradient of the following loss when using uniform sampling playback
L PER τ ( δ ( i ) ) = η N τ + α − α β ∣ δ ( i ) ∣ τ + α − α β , η = min j ∣ δ ( j ) ∣ α β ∑ j ∣ δ ( j ) ∣ α (6) \mathcal{L}_{\text{PER}}^\tau(\delta(i)) = \frac{\eta N}{\tau+\alpha-\alpha\beta}|\delta(i)|^{\tau+\alpha-\alpha\beta},\space\space\space\space\space \eta = \frac{\min_j|\delta(j)|^{\alpha\beta}}{\sum_j|\delta(j)|^\alpha} \tag{6} LPERτ(δ(i))=τ+α−αβηN∣δ(i)∣τ+α−αβ, η=∑j∣δ(j)∣αminj∣δ(j)∣αβ(6)proof: according to PER Definition , Yes
p ( i ) = ∣ δ ( i ) ∣ α + ϵ ∑ j ( ∣ δ ( j ) ∣ α + ϵ ) , w ( i ) = ( 1 N ⋅ 1 p ( i ) ) β max j ( 1 N ⋅ 1 p ( j ) ) β p(i) = \frac{|\delta(i)|^\alpha +\epsilon}{\sum_j(|\delta(j)|^\alpha +\epsilon)},\space\space\space w(i) = \frac{(\frac{1}{N}·\frac{1}{p(i)})^\beta}{\max_j(\frac{1}{N}·\frac{1}{p(j)})^\beta} p(i)=∑j(∣δ(j)∣α+ϵ)∣δ(i)∣α+ϵ, w(i)=maxj(N1⋅p(j)1)β(N1⋅p(i)1)β among α ∈ ( 0 , 1 ] , β ∈ [ 0 , 1 ] \alpha\in(0,1],\beta\in[0,1] α∈(0,1],β∈[0,1]. Let's examine the use of PER Time loss function 1 τ ∣ δ ( i ) ∣ τ \frac{1}{\tau}|\delta(i)|^\tau τ1∣δ(i)∣τ The expected gradient of
E i ∼ P E R [ ∇ Q w ( i ) 1 τ ∣ δ ( i ) ∣ τ ] = ∑ i ∈ B w ( i ) p ( i ) ∇ Q 1 τ ∣ δ ( i ) ∣ τ = ∑ i ∈ B ( 1 N ⋅ 1 p ( i ) ) β max j ∈ B ( 1 N ⋅ 1 p ( j ) ) β ∣ δ ( i ) ∣ α ∑ j ∈ B ∣ δ ( j ) ∣ α sign ( δ ( i ) ) ∣ δ ( i ) ∣ τ − 1 = 1 max j ∈ B 1 ∣ δ ( j ) ∣ α β ∑ j ∈ B ∣ δ ( j ) ∣ α ∑ i ∈ B ∣ δ ( i ) ∣ τ + α − 1 sign ( δ ( i ) ) ∣ δ ( i ) ∣ α β = η ∑ i ∈ B sign ( δ ( i ) ) ∣ δ ( i ) ∣ τ + α − α β − 1 . (7) \begin{aligned} \mathbb{E}_{i \sim \mathrm{PER}}\left[\nabla_{Q} w(i) \frac{1}{\tau}|\delta(i)|^{\tau}\right]&=\sum_{i \in \mathcal{B}} w(i) p(i) \nabla_{Q} \frac{1}{\tau}|\delta(i)|^{\tau}\\ &=\sum_{i \in \mathcal{B}} \frac{\left(\frac{1}{N} \cdot \frac{1}{p(i)}\right)^{\beta}}{\max _{j \in \mathcal{B}}\left(\frac{1}{N} \cdot \frac{1}{p(j)}\right)^{\beta}} \frac{|\delta(i)|^{\alpha}}{\sum_{j \in \mathcal{B}}|\delta(j)|^{\alpha}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau-1}\\ &=\frac{1}{\max _{j \in \mathcal{B}} \frac{1}{|\delta(j)|^{\alpha \beta}} \sum_{j \in \mathcal{B}}|\delta(j)|^{\alpha}} \sum_{i \in \mathcal{B}} \frac{|\delta(i)|^{\tau+\alpha-1} \operatorname{sign}(\delta(i))}{|\delta(i)|^{\alpha \beta}}\\ &=\eta \sum_{i \in \mathcal{B}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau+\alpha-\alpha \beta-1} \text {. } \end{aligned} \tag{7} Ei∼PER[∇Qw(i)τ1∣δ(i)∣τ]=i∈B∑w(i)p(i)∇Qτ1∣δ(i)∣τ=i∈B∑maxj∈B(N1⋅p(j)1)β(N1⋅p(i)1)β∑j∈B∣δ(j)∣α∣δ(i)∣αsign(δ(i))∣δ(i)∣τ−1=maxj∈B∣δ(j)∣αβ1∑j∈B∣δ(j)∣α1i∈B∑∣δ(i)∣αβ∣δ(i)∣τ+α−1sign(δ(i))=ηi∈B∑sign(δ(i))∣δ(i)∣τ+α−αβ−1. (7) Now consider L PER τ ( δ ( i ) ) \mathcal{L}_{\text{PER}}^\tau(\delta(i)) LPERτ(δ(i)) The expected gradient of
E i ∼ B [ ∇ Q L P E R τ ( δ ( i ) ) ] = 1 N ∑ i ∈ B η N τ + α − α β ∇ Q ∣ δ ( i ) ∣ τ + α − α β = η ∑ i ∈ B sign ( δ ( i ) ) ∣ δ ( i ) ∣ τ + α − α β − 1 (8) \begin{aligned} \mathbb{E}_{i \sim \mathcal{B}}\left[\nabla_{Q} \mathcal{L}_{\mathrm{PER}}^{\tau}(\delta(i))\right] &=\frac{1}{N} \sum_{i \in \mathcal{B}} \frac{\eta N}{\tau+\alpha-\alpha \beta} \nabla_{Q}|\delta(i)|^{\tau+\alpha-\alpha \beta} \\ &=\eta \sum_{i \in \mathcal{B}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau+\alpha-\alpha \beta-1} \tag{8} \end{aligned} Ei∼B[∇QLPERτ(δ(i))]=N1i∈B∑τ+α−αβηN∇Q∣δ(i)∣τ+α−αβ=ηi∈B∑sign(δ(i))∣δ(i)∣τ+α−αβ−1(8) Two equal , Certificate completion
2.1.4.2 inference 3(Huber Loss + PER The equivalent loss function of the corresponding uniform reloading )
consider PER original text , When and PER When used together , Tradition DQN The use of Huber The expected gradient of the loss is equal to the expected gradient of the following loss when using uniform sampling playback
L P E R Huber ( δ ( i ) ) = η N τ + α − α β ∣ δ ( i ) ∣ τ + α − α β , τ = { 2 if ∣ δ ( i ) ∣ ≤ 1 , 1 otherwise , η = min j ∣ δ ( j ) ∣ α β ∑ j ∣ δ ( j ) ∣ α (9) \mathcal{L}_{\mathrm{PER}}^{\text {Huber }}(\delta(i))=\frac{\eta N}{\tau+\alpha-\alpha \beta}|\delta(i)|^{\tau+\alpha-\alpha \beta}, \quad \tau=\left\{\begin{array}{ll} 2 & \text { if }|\delta(i)| \leq 1, \\ 1 & \text { otherwise }, \end{array} \quad \eta=\frac{\min _{j}|\delta(j)|^{\alpha \beta}}{\sum_{j}|\delta(j)|^{\alpha}}\right. \tag{9} LPERHuber (δ(i))=τ+α−αβηN∣δ(i)∣τ+α−αβ,τ={ 21 if ∣δ(i)∣≤1, otherwise ,η=∑j∣δ(j)∣αminj∣δ(j)∣αβ(9) It's just a way of τ = 1 \tau = 1 τ=1 and τ = 2 \tau=2 τ=2 Respectively into the equation (6)In order to understand Corollary 3 And its significance to PER Description of the goal , First consider the following two questions about MSE and L 1 L_1 L1 The observation of
Observation 2 ( Optimize MSE The estimated result of loss is the mean value of real samples , Unbiased ): B ( s , a ) ⊂ B \mathcal{B}(s,a)\subset \mathcal{B} B(s,a)⊂B Is included ( s , a ) (s,a) (s,a) Of transition Subset , δ ( i ) = Q ( i ) − y ( i ) \delta(i) = Q(i)-y(i) δ(i)=Q(i)−y(i), if ▽ Q E i ∼ B ( s , a ) [ 0.5 δ ( i ) 2 ] = 0 \triangledown_Q\mathbb{E}_{i\sim\mathcal{B}(s,a)}[0.5\delta(i)^2] = 0 ▽QEi∼B(s,a)[0.5δ(i)2]=0, be Q ( s , a ) = mean i ∈ B ( s , a ) y ( i ) Q(s,a) = \text{mean}_{i\in\mathcal{B}(s,a)}y(i) Q(s,a)=meani∈B(s,a)y(i)
proof:
E i ∼ B ( s , a ) [ ∇ Q 0.5 ∣ δ ( i ) ∣ 2 ] = 0 ⇒ E i ∼ B ( s , a ) [ δ ( i ) ] = 0 ⇒ 1 N ∑ i ∈ B ( s , a ) Q ( s , a ) − y ( i ) = 0 ⇒ Q ( s , a ) − 2 c N ∑ i ∈ B ( s , a ) y ( i ) = 0 ⇒ Q ( s , a ) = 1 N ∑ i ∈ B ( s , a ) y ( i ) . \begin{aligned} & \mathbb{E}_{i \sim \mathcal{B}(s, a)}\left[\nabla_{Q} 0.5|\delta(i)|^{2}\right]=0 \\ \Rightarrow & \mathbb{E}_{i \sim \mathcal{B}(s, a)}[\delta(i)]=0 \\ \Rightarrow & \frac{1}{N} \sum_{i \in \mathcal{B}(s, a)} Q(s, a)-y(i)=0 \\ \Rightarrow & Q(s, a)-\frac{2 c}{N} \sum_{i \in \mathcal{B}(s, a)} y(i)=0 \\ \Rightarrow & Q(s, a)=\frac{1}{N} \sum_{i \in \mathcal{B}(s, a)} y(i) . \end{aligned} ⇒⇒⇒⇒Ei∼B(s,a)[∇Q0.5∣δ(i)∣2]=0Ei∼B(s,a)[δ(i)]=0N1i∈B(s,a)∑Q(s,a)−y(i)=0Q(s,a)−N2ci∈B(s,a)∑y(i)=0Q(s,a)=N1i∈B(s,a)∑y(i).Observation 3 ( Optimize L 1 L_1 L1 The estimated result of the loss is the median of the real sample , Biased but more robust ): B ( s , a ) ⊂ B \mathcal{B}(s,a)\subset \mathcal{B} B(s,a)⊂B Is included ( s , a ) (s,a) (s,a) Of transition Subset , δ ( i ) = Q ( i ) − y ( i ) \delta(i) = Q(i)-y(i) δ(i)=Q(i)−y(i), if ▽ Q E i ∼ B ( s , a ) [ ∣ δ ( i ) ∣ ] = 0 \triangledown_Q\mathbb{E}_{i\sim\mathcal{B}(s,a)}[|\delta(i)|] = 0 ▽QEi∼B(s,a)[∣δ(i)∣]=0, be Q ( s , a ) = median i ∈ B ( s , a ) y ( i ) Q(s,a) = \text{median}_{i\in\mathcal{B}(s,a)}y(i) Q(s,a)=mediani∈B(s,a)y(i)
proof:
E i ∼ B ( s , a ) [ ∇ Q ∣ δ ( i ) ∣ ] = 0 ⇒ E i ∼ B ( s , a ) [ sign ( δ ( i ) ) ] = 0 ⇒ ∑ i ∈ B ( s , a ) 1 { Q ( s , a ) ≤ y ( i ) } = ∑ i ∈ B ( s , a ) 1 { Q ( s , a ) ≥ y ( i ) } ⇒ Q ( s , a ) = median i ∈ B ( s , a ) y ( i ) . \begin{aligned} & \mathbb{E}_{i \sim \mathcal{B}(s, a)}\left[\nabla_{Q}|\delta(i)|\right]=0 \\ \Rightarrow & \mathbb{E}_{i \sim \mathcal{B}(s, a)}[\operatorname{sign}(\delta(i))]=0 \\ \Rightarrow & \sum_{i \in \mathcal{B}(s, a)} \mathbb{1}\{Q(s, a) \leq y(i)\}=\sum_{i \in \mathcal{B}(s, a)} \mathbb{1}\{Q(s, a) \geq y(i)\} \\ \Rightarrow & Q(s, a)=\operatorname{median}_{i \in \mathcal{B}(s, a)} y(i) . \end{aligned} ⇒⇒⇒Ei∼B(s,a)[∇Q∣δ(i)∣]=0Ei∼B(s,a)[sign(δ(i))]=0i∈B(s,a)∑1{ Q(s,a)≤y(i)}=i∈B(s,a)∑1{ Q(s,a)≥y(i)}Q(s,a)=mediani∈B(s,a)y(i).
be based on Corollary 3 And the above observations , You can make the following statements
- if τ + α − α β ≠ 2 \tau+\alpha-\alpha\beta \neq 2 τ+α−αβ=2( No need MSE Loss ), be PER Our goal is biased ( The expectation of the estimated value is different from the expectation of the target ),Observation 2 By minimizing MSE You can get an estimate of the target of interest , expect TD target yes y ( i ) = r + γ E s ′ , a ′ [ Q ( s ′ , a ′ ) ] y(i) = r+\gamma\mathbb{E}_{s',a'}[Q(s',a')] y(i)=r+γEs′,a′[Q(s′,a′)]. therefore , Use PER Optimize the loss ∣ δ ( i ) ∣ τ |\delta(i)|^\tau ∣δ(i)∣τ, And τ + α − α β ≠ 2 \tau+\alpha-\alpha\beta \neq 2 τ+α−αβ=2 A biased estimate of the target will be obtained
- Not all estimation biases are equivalent , from Observation 3 It can be seen that , To minimize the L 1 L_1 L1 The loss gets the median of the goal rather than the expectation , Considering the function estimation and bootstrap The role of , people The median may be considered a reasonable estimate , Because it's more robust
- There is a possibility : Be situated between MSE and L 1 L_1 L1 The loss function between “ Robustness ” and “ correctness ” Balance between . From equation (6) so
- PER and L 1 L_1 L1 Loss of combination , because α ∈ ( 0 , 1 ] , β ∈ [ 0 , 1 ] \alpha\in(0,1],\beta\in[0,1] α∈(0,1],β∈[0,1], In the loss ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The power of 1 + α − α β ∈ [ 1 , 2 ] 1+\alpha-\alpha\beta\in[1,2] 1+α−αβ∈[1,2], It is equivalent to the above balance
- PER and MSE Loss of combination , if β < 1 \beta<1 β<1, Then when α ∈ ( 0 , 1 ] \alpha\in(0,1] α∈(0,1] Time loss ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The power of 2 + α − α β > 2 2+\alpha-\alpha\beta>2 2+α−αβ>2, It means although MSE When used alone, the average value is used to minimize the loss ( Least square method ), But when it comes to PER When combined , The loss will be minimized by some expression in favor of outliers . This deviation is explained on the basis of MSE In the standard algorithm of continuous control task PER Reasons for poor performance
- The importance sampling ratio can be omitted .PER The importance sampling ratio is used in the original text (IS) Reweighting the loss function , To reduce the gradient expectation deviation introduced by priority , We know that if the hyperparameter β = 1 β=1 β=1, With PER Of MSE It's unbiased (Observation 2),PER The original text also mentioned this point . But more importantly , The above theoretical derivation takes IS On going , So even if you don't use IS( β = 0 \beta=0 β=0), The deviation caused by non-uniform playback has also been eliminated in the obtained results
2.2 Put forward the method
First of all, we can draw a conclusion from our theoretical analysis
- According to the theorem 2, If the gradient is expected to be equal , Priority replay coordination λ L L 1 \lambda \mathcal{L}_{L_1} λLL1 Loss can reduce the gradient variance
- PER The problem is Combine priority replay MSE Loss of use , Results in the loss function after conversion to uniform loss ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ Is greater than 2, The loss optimization process will be too biased towards outliers ( This is the same reason that the least square method is greatly affected by outliers ); original DQN Medium Combine even playback MSE Loss of use , The estimation result is unbiased
- By inference 3 Analysis of , When When converting to uniform replay loss , In the loss ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The power of should be between 1( Corresponding L 1 L_1 L1 Loss ) To 2( Corresponding MSE Loss ) Between , Thus in “ Robustness ” and “ correctness ” Balance between
Based on the above conclusion , The author first designs a priority sampling scheme , be called “ Loss adjustment priority experience replay (Loss-Adjusted Prioritized Experience Replay, LAP) ”, And further use inference 1 Wait until the equivalent uniform replay loss “ Priority approximate loss (Prioritized Approximation Loss, PAL)”
2.2.1 Priority replay method LAP
In order to combine 2.2 Two conclusions in section , The author's idea is : Like theorem 2 Design the loss function as in L 2 = L L 1 \mathcal{L}_2 = \mathcal{L}_{L_1} L2=LL1, And the replay priority is designed as ∣ δ ( i ) ∣ α , α ∈ ( 0 , 1 ] |\delta(i)|^\alpha,\alpha\in(0,1] ∣δ(i)∣α,α∈(0,1], According to this inference 1 The equal expectation gradient uniform playback loss obtained from the conversion is
L 1 ( δ ( i ) ) = 1 λ ∣ p r ( i ) ∣ × L 2 ( δ ( i ) ) = 1 λ ∣ ∣ δ ( i ) ∣ α ∣ × ∣ δ ( i ) ∣ \mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i)) = \frac{1}{\lambda}||\delta(i)|^\alpha|_\times|\delta(i)| L1(δ(i))=λ1∣pr(i)∣×L2(δ(i))=λ1∣∣δ(i)∣α∣×∣δ(i)∣ among ∣ ⋅ ∣ × |·|_\times ∣⋅∣× yes Stop the gradient Symbol , Visible loss ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The power of is between 1 To 2 Between the twonotes : In the previous analysis , For the priority playback part, the loss is λ L 2 \lambda\mathcal{L}_2 λL2, It is not mentioned here λ \lambda λ I'm confused , I guess This is because Analyze all of the above L 1 \mathcal{L}_1 L1 and λ L 2 \lambda\mathcal{L}_2 λL2 All for 1 λ L 1 \frac{1}{\lambda}\mathcal{L}_1 λ1L1 and L 2 \mathcal{L}_2 L2 Does not affect the conclusion , Because only in this way can we keep the logic consistent . Obviously for Corollary 2 This replacement is no problem , But the most important thing is Theorem 2 I don't have a certificate
In practice L 1 L_1 L1 Loss is not the best , Because it steps a fixed size step every time it is updated ( The gradient of ± 1 \pm1 ±1), If the learning rate is too high , May often exceed the goal (overstepping the target), This leads to many shocks in the optimization process . So the author The common k = 1 k=1 k=1 Of Huber Loss , When the error is below the threshold 1 when ,Huber The loss comes from L 1 L_1 L1 Convert to MSE, So you can δ ( i ) \delta(i) δ(i) near 0 Scale the gradient appropriately . In order to eliminate the combination of priority replay MSE Deviation caused , Use uniform sampling playback in this interval , This can be done through the priority scheme p r ( i ) = max ( ∣ δ ( i ) ∣ α , 1 ) pr(i) = \max(|\delta(i)|^\alpha,1) pr(i)=max(∣δ(i)∣α,1) Realization ( Low priority samples are clipped to have a priority of at least 1, This becomes uniform sampling ). After this modification, we get LAP Algorithm , It can use the following non-uniform sampling plus Huber Loss to describe
p ( i ) = max ( ∣ δ ( i ) ∣ α , 1 ) ∑ j max ( ∣ δ ( j ) ∣ α , 1 ) , L Huber ( δ ( i ) ) = { 0.5 δ ( i ) 2 if ∣ δ ( i ) ∣ ≤ 1 ∣ δ ( i ) ∣ otherwise p(i)=\frac{\max \left(|\delta(i)|^{\alpha}, 1\right)}{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}, \quad \mathcal{L}_{\text {Huber }}(\delta(i))= \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1 \\ |\delta(i)| & \text { otherwise }\end{cases} p(i)=∑jmax(∣δ(j)∣α,1)max(∣δ(i)∣α,1),LHuber (δ(i))={ 0.5δ(i)2∣δ(i)∣ if ∣δ(i)∣≤1 otherwise - ∣ δ ( i ) ∣ > 1 |\delta(i)|>1 ∣δ(i)∣>1 when : L 1 L_1 L1 Loss , priority p r ( i ) = ∣ δ ( i ) ∣ α pr(i) = |\delta(i)|^\alpha pr(i)=∣δ(i)∣α
- ∣ δ ( i ) ∣ ≤ 1 |\delta(i)|\leq1 ∣δ(i)∣≤1 when :MSE Loss , Uniform sampling playback
On the basis of correcting abnormal deviation , max ( ∣ δ ( i ) ∣ α , 1 ) \max(|\delta(i)|^\alpha,1) max(∣δ(i)∣α,1) This cut also reduces p ( i ) ≈ 0 p(i)\approx 0 p(i)≈0 Occurs when dead transition The possibility of , So it's no longer needed PER Super parameter in ϵ \epsilon ϵ, Besides ,LAP The theorem is preserved 2 Stated Variance reduction characteristic , Because it Use on all samples with large errors L 1 L_1 L1 Loss
2.2.2 Uniform replay method PAL
According to the idea of theoretical analysis , author take LAP The image loss function with equal expected gradients is transformed into uniform replaying , Call it PAL, The details can be determined by Corollary 1 Introduction , namely L PLA ( δ ( i ) ) = 1 λ ∣ p r ( i ) ∣ × L Huber ( δ ( i ) ) \mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_{\text{Huber}}(\delta(i)) LPLA(δ(i))=λ1∣pr(i)∣×LHuber(δ(i)), give the result as follows
L P A L ( δ ( i ) ) = 1 λ { 0.5 δ ( i ) 2 if ∣ δ ( i ) ∣ ≤ 1 , ∣ δ ( i ) ∣ 1 + α 1 + α otherwise, λ = ∑ j max ( ∣ δ ( j ) ∣ α , 1 ) N . \mathcal{L}_{\mathrm{PAL}}(\delta(i))=\frac{1}{\lambda}\left\{\begin{array}{ll} 0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ \frac{|\delta(i)|^{1+\alpha}}{1+\alpha} & \text { otherwise, } \end{array} \quad \lambda=\frac{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}{N} .\right. LPAL(δ(i))=λ1{ 0.5δ(i)21+α∣δ(i)∣1+α if ∣δ(i)∣≤1, otherwise, λ=N∑jmax(∣δ(j)∣α,1).- ∣ δ ( i ) ∣ ≤ 1 |\delta(i)|\leq 1 ∣δ(i)∣≤1 when , p r ( i ) = 1 pr(i)=1 pr(i)=1, L PLA ( δ ( i ) ) = 1 λ ⋅ 1 ⋅ 0.5 δ ( i ) 2 = 1 λ ⋅ 0.5 δ ( i ) 2 \mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}·1·0.5\delta(i)^2 = \frac{1}{\lambda}·0.5\delta(i)^2 LPLA(δ(i))=λ1⋅1⋅0.5δ(i)2=λ1⋅0.5δ(i)2
- ∣ δ ( i ) ∣ > 1 |\delta(i)|> 1 ∣δ(i)∣>1 when , p r ( i ) = ∣ δ ( i ) ∣ α pr(i) = |\delta(i)|^\alpha pr(i)=∣δ(i)∣α, L PLA ( δ ( i ) ) = 1 λ ∣ ∣ δ ( i ) ∣ α ∣ × ∣ δ ( i ) ∣ \mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}||\delta(i)|^\alpha|_\times |\delta(i)| LPLA(δ(i))=λ1∣∣δ(i)∣α∣×∣δ(i)∣, The gradient of it is ▽ Q L PLA = 1 λ ∣ δ ( i ) ∣ α ▽ Q ∣ δ ( i ) ∣ \triangledown_Q\mathcal{L}_{\text{PLA}} = \frac{1}{\lambda}|\delta(i)|^\alpha\triangledown_Q |\delta(i)| ▽QLPLA=λ1∣δ(i)∣α▽Q∣δ(i)∣ For simplicity of representation and calculation , here Change the operation of stopping gradient to the equivalent operation of allowing gradient , namely ▽ Q 1 λ ∣ δ ( i ) ∣ 1 + α 1 + α = 1 λ ∣ δ ( i ) ∣ α ▽ Q ∣ δ ( i ) ∣ \triangledown_Q\frac{1}{\lambda}\frac{|\delta(i)|^{1+\alpha}}{1+\alpha} = \frac{1}{\lambda}|\delta(i)|^\alpha\triangledown_Q |\delta(i)| ▽Qλ11+α∣δ(i)∣1+α=λ1∣δ(i)∣α▽Q∣δ(i)∣
Observation 4 (LAP and PAL Have the same desired gradient )
Proof. From Corollary 1 we have:
L PAL ( δ ( i ) ) = 1 λ ∣ p r ( i ) ∣ × L Huber ( δ ( i ) ) = 1 λ ∣ max ( ∣ δ ( i ) ∣ α , 1 ) ∣ × L Huber ( δ ( i ) ) = 1 λ ∣ max ( ∣ δ ( i ) ∣ α , 1 ) ∣ × { 0.5 δ ( i ) 2 if ∣ δ ( i ) ∣ ≤ 1 , ∣ δ ( i ) ∣ otherwise, = 1 λ { 0.5 δ ( i ) 2 if ∣ δ ( i ) ∣ ≤ 1 , ∣ δ ( i ) ∣ 1 + α 1 + α otherwise, \begin{aligned} \mathcal{L}_{\text {PAL }}(\delta(i)) &=\frac{1}{\lambda}|p r(i)|_{\times} \mathcal{L}_{\text {Huber }}(\delta(i)) \\ &=\frac{1}{\lambda}\left|\max \left(|\delta(i)|^{\alpha}, 1\right)\right|_{\times} \mathcal{L}_{\text {Huber }}(\delta(i)) \\ &=\frac{1}{\lambda}\left|\max \left(|\delta(i)|^{\alpha}, 1\right)\right|_{\times} \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ |\delta(i)| & \text { otherwise, }\end{cases} \\ &=\frac{1}{\lambda} \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ \frac{|\delta(i)|^{1+\alpha}}{1+\alpha} & \text { otherwise, }\end{cases} \end{aligned} LPAL (δ(i))=λ1∣pr(i)∣×LHuber (δ(i))=λ1∣max(∣δ(i)∣α,1)∣×LHuber (δ(i))=λ1∣max(∣δ(i)∣α,1)∣×{ 0.5δ(i)2∣δ(i)∣ if ∣δ(i)∣≤1, otherwise, =λ1{ 0.5δ(i)21+α∣δ(i)∣1+α if ∣δ(i)∣≤1, otherwise, where
λ = ∑ j p r ( j ) N = ∑ j max ( ∣ δ ( j ) ∣ α , 1 ) N . \lambda=\frac{\sum_{j} p r(j)}{N}=\frac{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}{N} . λ=N∑jpr(j)=N∑jmax(∣δ(j)∣α,1). Then by Corollary 1, LAP and PAL have the same expected gradient.be aware ,PAL The power of the defined loss will never be greater than 2, It means PER The outlier deviation of has been eliminated . In those areas where the advantage of reducing variance due to priority playback is not valued ,PAL It should be with LAP With similar performance , And the implementation is simpler
3. experimental result
The author will LAP/PAL And TD3 and SAC Method combination , stay MuJoCo The test results on continuous control tasks are as follows

Found an increase in LAP/PAL after TD3 The performance of ; increase PER Because of PER + MSE The introduction of deviation leads to performance degradation ;LAP and PAL No big difference , It shows that there is almost no benefit to do non-uniform playback at this time ( The only advantage is to reduce variance , But reinforcement learning often doesn't pay much attention to this )The author will LAP/PAL And DDQN Method combination , stay Atari The test results on the task are as follows

Found an increase in LAP/PAL after DDQN The performance of ; increase LAP Post performance exceeds PER, But add PAL It's too late PER, This may be because Atria The environment needs a long track and Reward sparse , At this time, we still need to do the priority playback , Make guidance more denseBesides , The author is still there MuJoCo The environment combines TD3 Yes PAL Ablation experiments were carried out , be aware α = 0 \alpha=0 α=0 when PAL It's just a function 1 λ \frac{1}{\lambda} λ1 Put the shrunk Huber Loss , The experimental results show that the complete version PAL Achieve the highest performance ; Get rid of 1 λ \frac{1}{\lambda} λ1 Of PAL second ; stay Huber Increase in losses 1 λ \frac{1}{\lambda} λ1 Can significantly improve performance , This explanation The performance improvement mainly comes from the adjustment of the desired gradient
4. summary
- The main contribution of this paper is
- The equivalence between non-uniform playback and uniform playback , The conversion method is given . This is very important , Starting with the converted uniform playback loss function, the design of a new empirical playback method is more intuitive and simple , After this article, many articles about experience replay are done from this perspective
- The theoretical analysis foundation of non-uniform experience playback technology is established , similar PER, Before that Many experience replay methods are heuristic methods , Can be analyzed and optimized with this method
- Illustrates the PER The reason for performance degradation for continuous control tasks is PER combination MSE The gradient will be biased towards outliers when loss is used , And two improved methods are designed
- A good non-uniform playback method , It is equal to the expected gradient Uniform replay loss about ∣ δ ( i ) ∣ |\delta(i)| ∣δ(i)∣ The power of should be in 1 To 2 Between , Thus in “ Robustness ” and “ correctness ” Balance between ( This also applies when the loss is estimated by directly designing the uniform weight )
- Really do The main advantage of non-uniform playback is that it can reduce the variance of gradient , But this advantage is often unimportant ; The advantage of non-uniform playback in sparse reward environment may be the most obvious , Because the transformed equivalent loss function relies on importance sampling to balance the expected gradient direction ( This is similar to maximum likelihood estimation ), Very inaccurate in case of small sample size , However, the effective sample size in the sparse reward environment is very small , At this time, we still have to do non-uniform sampling to improve the performance
- The analysis of this paper is based on Bellman Equation On the basis of , Can it be right? Bellman Optimal Equation Establish a similar analytical framework
边栏推荐
- Use of cookies, sessions and tokens in web development of system development series
- English grammar_ adverb
- Project interview questions
- 了解图数据库neo4j(一)
- ESP32学习笔记【WiFi网络篇】-01AP&STA
- 倒计时 3 天 | SOFAChannel#28 SOFAArk 类隔离框架设计
- RMAN backup concept_ Online backup and backup mode
- MySQL basic subquery exercise
- 2022-2028 global online programming learning platform industry survey and trend analysis report
- 2022-2028 global technology monitoring countermeasures industry research and trend analysis report
猜你喜欢

Mode de programmation 3D: mode d'isolement dépendant

MySQL基础 多表查询
![[redis learning 12] sentry mechanism of distributed cache, partitioned cluster](/img/51/c69d56e1480c95dee562c0ccd42451.png)
[redis learning 12] sentry mechanism of distributed cache, partitioned cluster

Annexe 17 interprétation du programme réseau

3D編程模式:依賴隔離模式

English grammar_ adverb
![[redis learning 13] redis builds master-slave cluster, sentinel cluster and partition cluster](/img/e9/c0911d31dc348e1f1f9c67935ce923.png)
[redis learning 13] redis builds master-slave cluster, sentinel cluster and partition cluster

ERP 系统,编译和学习
![[antenna] [2] explanation of some nouns and simple concepts, still](/img/84/7fb2b01dc717eb9196b33066ea3b4e.png)
[antenna] [2] explanation of some nouns and simple concepts, still

远程预付费管理系统帮助物业解决收费难统计难问题
随机推荐
three.js学习笔记(十六)——汹涌的海洋
微信小程序之获取用户基本信息
[redis learning 12] sentry mechanism of distributed cache, partitioned cluster
Oracle locally managed tablespaces
[matlab] [digital simulation] [1] linprog solving linear programming, standard type
【Redis学习12】分布式缓存之哨兵机制,分片集群
MySQL basic query statement
The fire door monitoring system carries out 24-hour real-time automatic inspection on the status of the fire door
Update and delete operations in Clickhouse of data warehouse series
C language pointer
MySQL在存储过程中使用while批量插入数据(批量提交和单个提交的性能差异)
初级指针~带你入门指针
three.js学习笔记(十五)——着色器图案
Attachment 17: limited articles on network program interpretation
[texstudio] [2] general picture and table presentation
C pointer review
HVV blue team points north
2022-2028 global UAV detection and jamming system industry survey and trend analysis report
文件排序 (拓展)
MySQL queries all database table names and their comments