当前位置：网站首页>Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

2022-06-09 09:20:00 【Cloud fff】

title ：An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay
The article links ：An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay
publish ：NIPS 2020
field ： Reinforcement learning —— Replay Buffer

Abstract ： Priority experience replay （PER） It is a deep reinforcement learning technology , In this technology agent From non uniformly sampled transition Middle school learning ,transition The probability of being sampled is related to their TD error Is proportional to the . We proved that Based on non-uniform sampling transition Any loss function calculated can be transformed into another uniform sampling based function with the same desired gradient transition Loss function of . It's amazing , We find that in some environments ,PER It can be completely replaced by this new loss function , Without affecting the empirical performance . Besides , This relationship also suggests improvement PER A new branch of , That is, it is improved by modifying its equivalent uniform sampling loss function . We're in a couple of MuJoCo and Atari The environment proves that we are right PER And the modification of the equivalent loss function

List of articles

1. PER background
- 1.1 Experience Replay
- 1.2 Prioritized Experience Replay
2. Methods of this paper
3. experimental result
4. summary

1. PER background

1.1 Experience Replay

about off-policy In terms of reinforcement learning methods , Since... Is not required target policy and behavior policy identical , We can take the past experience transition Save and reuse , There are three benefits
1. Break the circular dependence of value estimation and exploration direction , send agent More stable training
2. Break the correlation of short-term data , Reduce the vibration during optimization
3. Increase data utilization
In order to achieve Experience Replay, Usually add one Experience Buffer Store past transition,agent The interaction diagram with the environment is as follows

1.2 Prioritized Experience Replay

The original Experience Replay in , Training agent The use of transition It's from Experience Buffer Uniformly sampled in , There is an implicit assumption here , Namely “agent From different transition The amount of learning is equal ”, But there is obviously something transition Is more conducive to agent Study of the , It's like supervising difficult samples and simple samples in learning , If you can emphasize these key points in the learning process transition, It may greatly improve the sample efficiency and convergence speed
Prioritized Experience Replay It is a method based on this idea , It thinks “TD error representative agent To a transition The degree of surprise , Can be used as the transition An estimate of the learning effort ”, So the method replays those high with a higher probability TD error Of transition, Use the heavy tailed distribution to guarantee transition Diversity , And the importance sampling ratio is used to eliminate the deviation of the desired gradient direction
About PER For a detailed introduction, please refer to ： Paper understanding 【RL - Exp Replay】 —— 【PER】Prioritized Experience Replay

2. Methods of this paper

2.1 The theoretical analysis

PER The paper is more heuristic , There is no strict theoretical derivation , It is found that sometimes PER This trick However, the performance of the algorithm is reduced , However, due to the lack of theoretical basis, it is difficult to quantitatively analyze the reasons . The author from PER cut-in , To PER In order to represent the experience replay of non-uniform priority, a detailed theoretical analysis is carried out , This is also the main contribution of this paper
In this paper, the general result yes ：“ Use a loss function $\mathcal{L}_1$ Non uniform priority replay strategy ” The expected gradient of （ Direction ） and “ Use another loss function $\mathcal{L}_2$ Uniform replay strategy ” They are equal. . Take advantage of this result , We can convert any non-uniform sampling playback method to another uniform sampling playback method using a specific loss function , Then the rationality of the original method is analyzed through the new loss function , As shown in the figure below

The author uses this method to analyze PER The problem is , And an improved method , This is another contribution of this paper

2.1.1 Formal description of the problem

First, define the object of analysis and mathematical form ： This paper aims at be based on Bellman equation Iterative off-policy value prediction Class method （ similar Actor-Critic In the frame Critic） Expand the analysis , In this case, we need to learn one Q Value network to estimate a Given policy $\pi$ The corresponding value
1. Set the network parameter to $\theta$ , Any state $(s, a)$ The value of is estimated to be
  $\begin{aligned} Q_\theta^\pi(s,a) &= \mathbb{E}_\pi[\sum_{t=0}^\infin \gamma^tr_{t+1}|s_0=s,a_0=a] \\ &= \mathbb{E}_{r,s'\sim p;a'\sim \pi}[r+\gamma Q_\theta^\pi(s',a')] \end{aligned}$ To a slave replay buffer $\mathcal{B}$ Sampling in transition $i = (s, a, r, s^{'})$ ,TD error by
  $\delta(i) = Q_\theta(i) - y(i) = Q_\theta(i)-(r+\gamma Q_{\theta'}(s,a))$ Among them, the learning objectives $r+\gamma Q_{\theta'}(s,a)$ By an independent , Parameter is $\theta'$ The target network of , Update every certain step $\theta' \leftarrow \theta$
2. be based on TD error Design loss function $\mathcal{L}(\delta(i))$ , In a dimension of $M$ Of mini-batch The average loss on is $\frac{1}{M}\sum_i \mathcal{L}(\delta(i))$
  The constant here $M$ Does not affect expected losses or gradients , Not important in analysis
3. Examine the gradients in updating the value network
  $\triangledown_\theta \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \delta} \frac{\partial \delta}{\partial Q} \frac{\partial Q}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial Q} \frac{\partial Q}{\partial \theta} = \triangledown_Q\mathcal{L}·\triangledown_\theta Q$ among $\triangledown_\theta Q$ And the only $Q$ Network structure , It's not in our concern , We only consider the loss function $\triangledown_Q\mathcal{L}$ term （ Be careful $\delta(i)$ It's about $Q (i)$ Function of ）
4. This article focuses on three types of losses
  1. $L_1$ Loss ： $\mathcal{L}_{L1}(\delta(i)) = |\delta(i)|$ , The gradient of $\triangledown_Q\mathcal{L}_{L1}(\delta(i)) = \text{sign}(\delta(i))$
  2. MSE Loss ： $\mathcal{L}_{\text{MSE}}(\delta(i)) = 0.5\delta(i)^2$ , The gradient of $\triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i)) = \delta(i)$
  3. Huber Loss ：
    $\mathcal{L}_{\text{Huber}} = \left\{ \begin{aligned} &0.5\delta(i)^2 && if \space\space|\delta(i)|\leq k,\\ &k(|\delta(i)|-0.5k) && \text{otherwise} \end{aligned} \right.$ Usually set $k = 1$ , This is based on $|\delta(i)|$ The value of ,Huber The gradient of loss is equivalent to MSE or $L_1$ Loss . This loss function can be regarded as 0 Smoother nearby $L_1$ Loss

2.1.2 Theorem 1（ The condition of equal expectation gradient transformation ）

Give the implementation 2.1 The idea of the transformation of the sub part of the section Introduction ： Given transition Distribution $\mathcal{D}_1$ and $\mathcal{D}_2$ , Loss function $\mathcal{L}_1$ and $\mathcal{L}_2$ . For the first $i$ Samples , stay $\mathcal{D}_1$ Up sampling and use $\mathcal{L}_1$ Expected loss gradient , By using the importance sampling ratio $\frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)}$ From another distribution $\mathcal{D}_2$ Upper determination

hypothesis $\mathcal{L}_2$ The gradient of is the formula inside the right formula , namely
$\triangledown_Q\mathcal{L}_2(\delta(i)) = \frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)}\triangledown_Q\mathcal{L}_1(\delta(i)) \tag{1}$ Meeting such conditions $\mathcal{L}_2$ It can ensure that the expected gradients under the two distributions are equal , namely $\mathbb{E}_{\mathcal{D}_1}[\triangledown_Q\mathcal{L}_1(\delta(i))] = \mathbb{E}_{\mathcal{D}_2}[\triangledown_Q\mathcal{L}_2(\delta(i))]$
for instance , If you put $\mathcal{D}_1$ Defined as a finite set $\mathcal{B}$ Even distribution on $\mathcal{U}$ , $\mathcal{D}_2$ Defined as the priority sampling distribution $\frac{|\delta(i)|}{\sum_{j\in\mathcal{B}}|\delta(j)|}$ , be MSE and $L_1$ There are the following relationships
$\begin{aligned} \mathbb{E}_\mathcal{U}\big[\triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i)) \big] &= \mathbb{E}_{\mathcal{D}_2}\big[\frac{p_{\mathcal{D}_1}(i)}{p_{\mathcal{D}_2}(i)}\triangledown_Q\mathcal{L}_{\text{MSE}}(\delta(i))\big] \\ &=\mathbb{E}_{\mathcal{D}_2}\big[ \frac{1}{N}\frac{\sum_j|\delta(i)|}{|\delta(i)|}\delta(i) \big] \\ &= \mathbb{E}_{\mathcal{D}_2}\big[\frac{\sum_j|\delta(i)|}{N} \frac{\delta(i)}{|\delta(i)|}\big] \\ &=\mathbb{E}_{\mathcal{D}_2}\big[\frac{\sum_j|\delta(i)|}{N}\text{sign}(\delta(i))\big] \\ &\propto\mathbb{E}_{\mathcal{D}_2}\big[\text{sign}(\delta(i))\big] \\ &= \mathbb{E}_\mathcal{\mathcal{D}_2}\big[\triangledown_Q\mathcal{L}_{\text{L}_1}(\delta(i)) \big] \end{aligned}$ It means “ Use priority sampling for playback $L_1$ Loss ” and “ Playback using uniform sampling MSE Loss ” Have the same desired gradient direction . Also note the... In the fourth equal sign here coefficient $\frac{\sum_j|\delta(i)|}{N} = \frac{\sum_j pr(j)}{N}$ （ $p r (i)$ On behalf of the $i$ Sample priority ）, This coefficient is a bridge between uniform distribution and priority distribution , In the following analysis, there will often be
Theorem 1： Given size is $N$ Data set of $\mathcal{B}$ , Loss $\mathcal{L}_1,\mathcal{L}_2$ And some priority is $p r$ Priority distribution of （ Sample $i$ The probability of being selected is $\frac{pr(i)}{\sum_jpr(j)}$ ）, if $\triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i))$ among $\lambda = \frac{\sum_jpr(j)}{N}$ , From $\mathcal{B}$ in Uniformly sampled samples $i$ Corresponding $\mathcal{L}_1(\delta(i))$ The expected gradient of , And from the $\mathcal{B}$ in Press $p r$ Preferred samples $i$ Corresponding $\mathcal{L}_2(\delta(i))$ The expected gradients of are equal
Proof：
$\begin{aligned} \mathbb{E}_{i\sim\mathcal{B}}[\triangledown_Q\mathcal{L}_1(\delta(i))] &= \frac{1}{N}\sum_i\triangledown_Q\mathcal{L}_1(\delta(i))\\ &= \frac{1}{N}\sum_i\frac{N}{\sum_jpr(j)}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i))\space\space\space\space\space\space( Assumptions )\\ &=\sum_i\frac{pr(i)}{\sum_jpr(j)}\triangledown_Q\mathcal{L}_2(\delta(i)) \\ &=\mathbb{E}_{i\sim pr}[\triangledown_Q\mathcal{L}_2(\delta(i))] \end{aligned}$ Another better way to understand it is to put $\lambda$ Plug in , There is $\triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{pr(i)}{\sum_jpr(j)}N\triangledown_Q\mathcal{L}_2(\delta(i)) = \frac{p_{\mathcal{D}_2}(i)}{p_{\mathcal{D}_1}(i)}\triangledown_Q\mathcal{L}_2(\delta(i))$ It is found that the equation (1), From this, we can obtain
Theorem 1 Is the main result of this paper , It says ： Guarantee “ Uniformly sampled $\mathcal{L}_1$ Loss ” and “ According to a certain priority scheme $p r$ Sampled by $\mathcal{L}_2$ Loss ” The condition with the same expected gradient is $\triangledown_Q\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i))$

2.1.2.1 inference 1（ Construct uniform playback loss with equal expected gradient $\mathcal{L}_1$ Methods ）

Corollary 1： if $\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i))$ for all $i$ , among $\lambda = \frac{\sum_jpr(j)}{N}$ , $|·|_\times$ yes Stop the gradient Symbol , be Theorem 1 Loss of arbitrary uniform sampling $\mathcal{L}_1$ And press any by priority $p r$ Loss of nonuniform sampling $\mathcal{L}_2$ All set up
proof：
$\begin{aligned} \triangledown_Q\mathcal{L}_1(\delta(i)) &= \triangledown_Q\frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i)) \\ &= \frac{1}{\lambda}pr(i)\triangledown_Q\mathcal{L}_2(\delta(i)) \end{aligned}$ Satisfy Theorem 1 Conditions of establishment , Obtain evidence
The inference tells us , Given press $p r$ Non uniform replay of priority and corresponding $\mathcal{L}_2$ When it's lost , How to construct the loss function of equal expected gradient in uniform playback $\mathcal{L}_1$

2.1.2.2 inference 2（ Construct equal expectation gradient priority replay priority $p r$ And the corresponding loss function $\lambda\mathcal{L}_2$ Methods ）

Corollary 2： if $\text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i)))$ And $\frac{\triangledown_Q \mathcal{L}_1(\delta(i))}{\triangledown_Q \mathcal{L}_2(\delta(i))}$ for all $i$ , be Theorem 1 Loss of arbitrary uniform sampling $\mathcal{L}_1$ And any press $p r$ Loss of nonuniform sampling $\lambda\mathcal{L}_2$ All set up , among $\lambda = \frac{\sum_jpr(j)}{N}$
proof： Given $\text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i)))$
$\begin{aligned} \frac{1}{\lambda}pr(i)\triangledown_Q\lambda\mathcal{L}_2(\delta(i)) &= \frac{\lambda}{\lambda}\frac{\triangledown_Q\mathcal{L}_1(\delta(i))}{\triangledown_Q\mathcal{L}_2(\delta(i))}\triangledown_Q\mathcal{L}_2(\delta(i)) \\ &= \triangledown_Q\mathcal{L}_1(\delta(i)) \end{aligned}$ Satisfy Theorem 1 Conditions of establishment , Obtain evidence
Since the sampling priority cannot be negative or 0, There must be $\text{sign}(pr(i)) = \text{sign}(\frac{\triangledown_Q \mathcal{L}_1(\delta(i))}{\triangledown_Q \mathcal{L}_2(\delta(i))})=1$ , Therefore, conditions need to be met $\text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i)))$ . Usually , All designed to minimize $Q$ The loss function of the distance between the output and the given target satisfies this condition
Because we have the same goal , The optimization direction is the same , The gradient direction should also be approximately the same
In this case, the function of non-uniform sampling is similar to that of importance sampling , It's right $\mathcal{L}_2$ Do reweighting to match $\mathcal{L}_1$ The expected gradient of
The inference tells us , Given by uniform playback corresponding to $\mathcal{L}_1$ Loss and another loss function $\mathcal{L}_2$ when , How to construct a non-uniform replay priority that guarantees equal expected gradients $p r$ And the corresponding loss function $\lambda\mathcal{L}_2$
for instance , When $\mathcal{L}_2$ yes $L_1$ When it's lost , Because of its gradient $\triangledown_Q\mathcal{L}_{L_1}(\delta(i)) = \pm 1$ , hold $p r$ Set to $|\triangledown_Q\mathcal{L}_1(\delta(i))|$ , You can take even samples $\mathcal{L}_1$ Loss conversion to priority sampling mechanism . The fact proved that , This reduces the variance of the gradient （2.1.2.1 section Observation 1）

2.1.3 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

2.1.3.1 Observe 1（ The variance decreases ）

Observation 1： Given size is $N$ Data set of $\mathcal{B}$ And the loss function $\mathcal{L}_1$ , sample $i$ In priority $|\triangledown_Q\mathcal{L}_1(\delta(i))|$ Take priority sampling , Yes $\lambda = \frac{\sum_j pr(j)}{N}$ , be $\lambda\mathcal{L}_{L_1}(\delta{(i)})$ The variance of the gradient of is less than or equal to that of uniformly sampled $\mathcal{L}_1(\delta(i))$ The variance of the gradient of
This observation means , If set $\mathcal{L}_2 = L_1$ , For any uniform playback loss $\mathcal{L}_1$ , By inference 2 Construct the priority of priority replay $p r$ And corresponding priority losses $\lambda\mathcal{L}_2$ , Can reduce mini-batch The variance of the gradient
This observation is mainly used to derive the theorem 2, Therefore, the detailed proof will not be given （ The following theorem 2 After proof , You can use the theorem directly 2 Prove this observation ）

2.1.3.2 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

Theorem 2： Given size is $N$ Data set of $\mathcal{B}$ And the loss function $\mathcal{L}_1,\mathcal{L}_2$ , utilize Corollary 2 Priority schemes can be designed $p r$ And the corresponding loss function $\lambda \mathcal{L}_2$ （ among $\lambda = \frac{\sum_j pr(j)}{N}$ ）, bring Theorem 1 establish . When $\mathcal{L}_2 =\mathcal{L}_{L_1}$ And press Corollary 2 Design $|\triangledown_Q\mathcal{L}_1(\delta(i))|$ when , $\triangledown_Q\lambda\mathcal{L}_2(\delta(i)$ In all loss functions with the same expected gradient （ And the corresponding priority scheme ） Medium minimum
Proof：
1. Consider the variance of the priority sampling gradient （ Note that the variance formula is $\mathbb{E}[x^2]-\mathbb{E}[x]^2$ ）
  $\begin{aligned} \operatorname{Var}\left(\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right) &=\mathbb{E}_{i \sim p r}\left[\left(\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right)^{2}\right]-\mathbb{E}_{i \sim p r}\left[\nabla_{Q} \lambda \mathcal{L}_{2}(\delta(i))\right]^{2} \\ &=\sum_{i} \frac{p r(i)}{\sum_{j} p r(j)} \frac{\left(\sum_{j} p r(j)\right)^{2}}{N^{2}}\left(\nabla_{Q} \mathcal{L}_{2}(\delta(i))\right)^{2}-X \\ &=\frac{\sum_{j} p r(j)}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) \nabla_{Q} \mathcal{L}_{2}(\delta(i))-X \end{aligned} \tag{2}$ among $\mathbb{E}_{i\sim\mathcal{B}}[\triangledown_Q\mathcal{L}_1(\delta(i))]^2 = \mathbb{E}_{i\sim pr}[\triangledown_Q\mathcal{L}_2(\delta(i))]^2$ Is the square of the unbiased expected gradient
2. Be careful $\text{sign}(\triangledown_Q\mathcal{L}_1(\delta(i))) = \text{sign}(\triangledown_Q\mathcal{L}_2(\delta(i)))$ , Make $\mathcal{L}_2 = \mathcal{L}_{L_1}$ , Yes $\triangledown_Q\mathcal{L}_2(\delta(i)) = \pm 1$ , be $\triangledown_Q\mathcal{L}_1(\delta(i))\triangledown_Q\mathcal{L}_2(\delta(i)) = |\triangledown_Q\mathcal{L}_1(\delta(i))|$ . set up $|\triangledown_Q\mathcal{L}_1(\delta(i))|$ , The above formula can further simplify
  $\begin{aligned} &= \frac{\sum_j|\triangledown_Q\mathcal{L}_1(\delta(j))|}{N^2}\sum_i|\triangledown_Q\mathcal{L}_1(\delta(i))|-X \\ &= \Big(\frac{\sum_j|\triangledown_Q\mathcal{L}_1(\delta(j))|}{N}\Big)^2 - X \end{aligned} \tag{3}$
3. Consider the general situation , $\triangledown_Q\mathcal{L}_2(\delta(i)) = f(\delta(i))$ , To ensure that the desired gradient is consistent , according to Theorem 1 Must be set to $\frac{\triangledown_Q\mathcal{L}_1(\delta(i)}{f(\delta(i))}$ , To calculate the variance , Bring this item into the formula (2)
  $\begin{aligned} &=\frac{\sum_{j} p r(j)}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) \nabla_{Q} \mathcal{L}_{2}(\delta(i))-X \\ &=\frac{\sum_{j} \nabla_{Q} \mathcal{L}_{1}(\delta(j)) / f(\delta(j))}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) f(\delta(i))-X . \end{aligned} \tag{4}$ And then choose （ I don't know how to come up with this structure ）
  $u_j = \frac{\sqrt{\frac{\triangledown_Q\mathcal{L}_1(\delta(i)}{f(\delta(i))}}}{\sqrt{N}}, v_j = \frac{\sqrt{\triangledown_Q\mathcal{L}_1(\delta(i)f(\delta(i))}}{\sqrt{N}}$
  according to Cauchy-Schwarz inequality ： $(\pmb{x},\pmb{y})^2\leq (\pmb{x},\pmb{x})(\pmb{y},\pmb{y})$ , namely $(\sum_ia_ib_i)^2\leq \sum_ia^2\sum_ib^2$ , Yes
  $\left(\frac{\sum_{j}\left|\nabla_{Q} \mathcal{L}_{1}(\delta(j))\right|}{N}\right)^{2} \leq \frac{\sum_{j} \nabla_{Q} \mathcal{L}_{1}(\delta(j)) / f(\delta(j))}{N^{2}} \sum_{i} \nabla_{Q} \mathcal{L}_{1}(\delta(i)) f(\delta(i)) \tag{5}$ When $f(\delta(j)) = \pm c$ Constant equals sign holds . So when $\mathcal{L}_2$ yes $L_1$ When it's lost , $\lambda\mathcal{L}_2$ Minimizing the variance of
Theorem 2 explain ： For any loss function in the case of uniform playback $\mathcal{L}_1$ , Can keep the desired gradient constant , Convert it to a priority playback scheme $|\triangledown_Q\mathcal{L}_1(\delta(i))|$ Under the $\lambda \mathcal{L}_{L_1}$ Loss , To reduce the variance of the gradient

2.1.4 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

To analyze PER Characteristics of , Can pass Theorem 1 The equivalent loss function in the case of uniform sampling is derived and analyzed . original PER And are used in the paper DQN same Huber Loss , First, the generalized PER Combined losses $\frac{1}{\tau}|\delta(i)|^\tau \space (\tau>0)$ General results when used together

2.1.4.1 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

Theorem 3 ： When and PER When used together , Loss $\frac{1}{\tau}|\delta(i)|^\tau$ （ $\tau>0$ ） The expected gradient of is equal to the expected gradient of the following loss when using uniform sampling playback
$\mathcal{L}_{\text{PER}}^\tau(\delta(i)) = \frac{\eta N}{\tau+\alpha-\alpha\beta}|\delta(i)|^{\tau+\alpha-\alpha\beta},\space\space\space\space\space \eta = \frac{\min_j|\delta(j)|^{\alpha\beta}}{\sum_j|\delta(j)|^\alpha} \tag{6}$
proof： according to PER Definition , Yes
$\frac{|\delta(i)|^\alpha +\epsilon}{\sum_j(|\delta(j)|^\alpha +\epsilon)},\space\space\space w(i) = \frac{(\frac{1}{N}·\frac{1}{p(i)})^\beta}{\max_j(\frac{1}{N}·\frac{1}{p(j)})^\beta}$ among $\alpha\in(0,1],\beta\in[0,1]$ . Let's examine the use of PER Time loss function $\frac{1}{\tau}|\delta(i)|^\tau$ The expected gradient of
$\begin{aligned} \mathbb{E}_{i \sim \mathrm{PER}}\left[\nabla_{Q} w(i) \frac{1}{\tau}|\delta(i)|^{\tau}\right]&=\sum_{i \in \mathcal{B}} w(i) p(i) \nabla_{Q} \frac{1}{\tau}|\delta(i)|^{\tau}\\ &=\sum_{i \in \mathcal{B}} \frac{\left(\frac{1}{N} \cdot \frac{1}{p(i)}\right)^{\beta}}{\max _{j \in \mathcal{B}}\left(\frac{1}{N} \cdot \frac{1}{p(j)}\right)^{\beta}} \frac{|\delta(i)|^{\alpha}}{\sum_{j \in \mathcal{B}}|\delta(j)|^{\alpha}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau-1}\\ &=\frac{1}{\max _{j \in \mathcal{B}} \frac{1}{|\delta(j)|^{\alpha \beta}} \sum_{j \in \mathcal{B}}|\delta(j)|^{\alpha}} \sum_{i \in \mathcal{B}} \frac{|\delta(i)|^{\tau+\alpha-1} \operatorname{sign}(\delta(i))}{|\delta(i)|^{\alpha \beta}}\\ &=\eta \sum_{i \in \mathcal{B}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau+\alpha-\alpha \beta-1} \text {. } \end{aligned} \tag{7}$ Now consider $\mathcal{L}_{\text{PER}}^\tau(\delta(i))$ The expected gradient of
$\begin{aligned} \mathbb{E}_{i \sim \mathcal{B}}\left[\nabla_{Q} \mathcal{L}_{\mathrm{PER}}^{\tau}(\delta(i))\right] &=\frac{1}{N} \sum_{i \in \mathcal{B}} \frac{\eta N}{\tau+\alpha-\alpha \beta} \nabla_{Q}|\delta(i)|^{\tau+\alpha-\alpha \beta} \\ &=\eta \sum_{i \in \mathcal{B}} \operatorname{sign}(\delta(i))|\delta(i)|^{\tau+\alpha-\alpha \beta-1} \tag{8} \end{aligned}$ Two equal , Certificate completion

2.1.4.2 inference 3（Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

consider PER original text , When and PER When used together , Tradition DQN The use of Huber The expected gradient of the loss is equal to the expected gradient of the following loss when using uniform sampling playback
$\mathcal{L}_{\mathrm{PER}}^{\text {Huber }}(\delta(i))=\frac{\eta N}{\tau+\alpha-\alpha \beta}|\delta(i)|^{\tau+\alpha-\alpha \beta}, \quad \tau=\left\{\begin{array}{ll} 2 & \text { if }|\delta(i)| \leq 1, \\ 1 & \text { otherwise }, \end{array} \quad \eta=\frac{\min _{j}|\delta(j)|^{\alpha \beta}}{\sum_{j}|\delta(j)|^{\alpha}}\right. \tag{9}$ It's just a way of $\tau = 1$ and $\tau=2$ Respectively into the equation (6)
In order to understand Corollary 3 And its significance to PER Description of the goal , First consider the following two questions about MSE and $L_1$ The observation of
1. Observation 2 ( Optimize MSE The estimated result of loss is the mean value of real samples , Unbiased )： $\mathcal{B}(s,a)\subset \mathcal{B}$ Is included $(s, a)$ Of transition Subset , $\delta(i) = Q(i)-y(i)$ , if $\triangledown_Q\mathbb{E}_{i\sim\mathcal{B}(s,a)}[0.5\delta(i)^2] = 0$ , be $\text{mean}_{i\in\mathcal{B}(s,a)}y(i)$
  proof:
  $\begin{aligned} & \mathbb{E}_{i \sim \mathcal{B}(s, a)}\left[\nabla_{Q} 0.5|\delta(i)|^{2}\right]=0 \\ \Rightarrow & \mathbb{E}_{i \sim \mathcal{B}(s, a)}[\delta(i)]=0 \\ \Rightarrow & \frac{1}{N} \sum_{i \in \mathcal{B}(s, a)} Q(s, a)-y(i)=0 \\ \Rightarrow & Q(s, a)-\frac{2 c}{N} \sum_{i \in \mathcal{B}(s, a)} y(i)=0 \\ \Rightarrow & Q(s, a)=\frac{1}{N} \sum_{i \in \mathcal{B}(s, a)} y(i) . \end{aligned}$
2. Observation 3 ( Optimize $L_1$ The estimated result of the loss is the median of the real sample , Biased but more robust )： $\mathcal{B}(s,a)\subset \mathcal{B}$ Is included $(s, a)$ Of transition Subset , $\delta(i) = Q(i)-y(i)$ , if $\triangledown_Q\mathbb{E}_{i\sim\mathcal{B}(s,a)}[|\delta(i)|] = 0$ , be $\text{median}_{i\in\mathcal{B}(s,a)}y(i)$
  proof:
  $\begin{aligned} & \mathbb{E}_{i \sim \mathcal{B}(s, a)}\left[\nabla_{Q}|\delta(i)|\right]=0 \\ \Rightarrow & \mathbb{E}_{i \sim \mathcal{B}(s, a)}[\operatorname{sign}(\delta(i))]=0 \\ \Rightarrow & \sum_{i \in \mathcal{B}(s, a)} \mathbb{1}\{Q(s, a) \leq y(i)\}=\sum_{i \in \mathcal{B}(s, a)} \mathbb{1}\{Q(s, a) \geq y(i)\} \\ \Rightarrow & Q(s, a)=\operatorname{median}_{i \in \mathcal{B}(s, a)} y(i) . \end{aligned}$
be based on Corollary 3 And the above observations , You can make the following statements
1. if $\tau+\alpha-\alpha\beta \neq 2$ （ No need MSE Loss ）, be PER Our goal is biased （ The expectation of the estimated value is different from the expectation of the target ）,Observation 2 By minimizing MSE You can get an estimate of the target of interest , expect TD target yes $r+\gamma\mathbb{E}_{s',a'}[Q(s',a')]$ . therefore , Use PER Optimize the loss $|\delta(i)|^\tau$ , And $\tau+\alpha-\alpha\beta \neq 2$ A biased estimate of the target will be obtained
2. Not all estimation biases are equivalent , from Observation 3 It can be seen that , To minimize the $L_1$ The loss gets the median of the goal rather than the expectation , Considering the function estimation and bootstrap The role of , people The median may be considered a reasonable estimate , Because it's more robust
3. There is a possibility ： Be situated between MSE and $L_1$ The loss function between “ Robustness ” and “ correctness ” Balance between . From equation (6) so
  1. PER and $L_1$ Loss of combination , because $\alpha\in(0,1],\beta\in[0,1]$ , In the loss $|\delta(i)|$ The power of $1+\alpha-\alpha\beta\in[1,2]$ , It is equivalent to the above balance
  2. PER and MSE Loss of combination , if $\beta<1$ , Then when $\alpha\in(0,1]$ Time loss $|\delta(i)|$ The power of $2+\alpha-\alpha\beta>2$ , It means although MSE When used alone, the average value is used to minimize the loss （ Least square method ）, But when it comes to PER When combined , The loss will be minimized by some expression in favor of outliers . This deviation is explained on the basis of MSE In the standard algorithm of continuous control task PER Reasons for poor performance
4. The importance sampling ratio can be omitted .PER The importance sampling ratio is used in the original text （IS） Reweighting the loss function , To reduce the gradient expectation deviation introduced by priority , We know that if the hyperparameter $β = 1$ , With PER Of MSE It's unbiased （Observation 2）,PER The original text also mentioned this point . But more importantly , The above theoretical derivation takes IS On going , So even if you don't use IS（ $\beta=0$ ）, The deviation caused by non-uniform playback has also been eliminated in the obtained results

2.2 Put forward the method

First of all, we can draw a conclusion from our theoretical analysis
1. According to the theorem 2, If the gradient is expected to be equal , Priority replay coordination $\lambda \mathcal{L}_{L_1}$ Loss can reduce the gradient variance
2. PER The problem is Combine priority replay MSE Loss of use , Results in the loss function after conversion to uniform loss $|\delta(i)|$ Is greater than 2, The loss optimization process will be too biased towards outliers （ This is the same reason that the least square method is greatly affected by outliers ）; original DQN Medium Combine even playback MSE Loss of use , The estimation result is unbiased
3. By inference 3 Analysis of , When When converting to uniform replay loss , In the loss $|\delta(i)|$ The power of should be between 1（ Corresponding $L_1$ Loss ） To 2（ Corresponding MSE Loss ） Between , Thus in “ Robustness ” and “ correctness ” Balance between
Based on the above conclusion , The author first designs a priority sampling scheme , be called “ Loss adjustment priority experience replay （Loss-Adjusted Prioritized Experience Replay, LAP） ”, And further use inference 1 Wait until the equivalent uniform replay loss “ Priority approximate loss （Prioritized Approximation Loss, PAL）”

2.2.1 Priority replay method LAP

In order to combine 2.2 Two conclusions in section , The author's idea is ： Like theorem 2 Design the loss function as in $\mathcal{L}_2 = \mathcal{L}_{L_1}$ , And the replay priority is designed as $|\delta(i)|^\alpha,\alpha\in(0,1]$ , According to this inference 1 The equal expectation gradient uniform playback loss obtained from the conversion is
$\mathcal{L}_1(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_2(\delta(i)) = \frac{1}{\lambda}||\delta(i)|^\alpha|_\times|\delta(i)|$ among $|·|_\times$ yes Stop the gradient Symbol , Visible loss $|\delta(i)|$ The power of is between 1 To 2 Between the two
notes ： In the previous analysis , For the priority playback part, the loss is $\lambda\mathcal{L}_2$ , It is not mentioned here $\lambda$ I'm confused , I guess This is because Analyze all of the above $\mathcal{L}_1$ and $\lambda\mathcal{L}_2$ All for $\frac{1}{\lambda}\mathcal{L}_1$ and $\mathcal{L}_2$ Does not affect the conclusion , Because only in this way can we keep the logic consistent . Obviously for Corollary 2 This replacement is no problem , But the most important thing is Theorem 2 I don't have a certificate
In practice $L_1$ Loss is not the best , Because it steps a fixed size step every time it is updated （ The gradient of $\pm1$ ）, If the learning rate is too high , May often exceed the goal （overstepping the target）, This leads to many shocks in the optimization process . So the author The common $k = 1$ Of Huber Loss , When the error is below the threshold 1 when ,Huber The loss comes from $L_1$ Convert to MSE, So you can $\delta(i)$ near 0 Scale the gradient appropriately . In order to eliminate the combination of priority replay MSE Deviation caused , Use uniform sampling playback in this interval , This can be done through the priority scheme $\max(|\delta(i)|^\alpha,1)$ Realization （ Low priority samples are clipped to have a priority of at least 1, This becomes uniform sampling ）. After this modification, we get LAP Algorithm , It can use the following non-uniform sampling plus Huber Loss to describe
$p(i)=\frac{\max \left(|\delta(i)|^{\alpha}, 1\right)}{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}, \quad \mathcal{L}_{\text {Huber }}(\delta(i))= \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1 \\ |\delta(i)| & \text { otherwise }\end{cases}$
1. $|\delta(i)|>1$ when ： $L_1$ Loss , priority $|\delta(i)|^\alpha$
2. $|\delta(i)|\leq1$ when ：MSE Loss , Uniform sampling playback
On the basis of correcting abnormal deviation , $\max(|\delta(i)|^\alpha,1)$ This cut also reduces $p(i)\approx 0$ Occurs when dead transition The possibility of , So it's no longer needed PER Super parameter in $\epsilon$ , Besides ,LAP The theorem is preserved 2 Stated Variance reduction characteristic , Because it Use on all samples with large errors $L_1$ Loss

2.2.2 Uniform replay method PAL

According to the idea of theoretical analysis , author take LAP The image loss function with equal expected gradients is transformed into uniform replaying , Call it PAL, The details can be determined by Corollary 1 Introduction , namely $\mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}|pr(i)|_\times \mathcal{L}_{\text{Huber}}(\delta(i))$ , give the result as follows
$\mathcal{L}_{\mathrm{PAL}}(\delta(i))=\frac{1}{\lambda}\left\{\begin{array}{ll} 0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ \frac{|\delta(i)|^{1+\alpha}}{1+\alpha} & \text { otherwise, } \end{array} \quad \lambda=\frac{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}{N} .\right.$
1. $|\delta(i)|\leq 1$ when , $p r (i) = 1$ , $\mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}·1·0.5\delta(i)^2 = \frac{1}{\lambda}·0.5\delta(i)^2$
2. $|\delta(i)|> 1$ when , $|\delta(i)|^\alpha$ , $\mathcal{L}_{\text{PLA}}(\delta(i)) = \frac{1}{\lambda}||\delta(i)|^\alpha|_\times |\delta(i)|$ , The gradient of it is $\triangledown_Q\mathcal{L}_{\text{PLA}} = \frac{1}{\lambda}|\delta(i)|^\alpha\triangledown_Q |\delta(i)|$ For simplicity of representation and calculation , here Change the operation of stopping gradient to the equivalent operation of allowing gradient , namely $\triangledown_Q\frac{1}{\lambda}\frac{|\delta(i)|^{1+\alpha}}{1+\alpha} = \frac{1}{\lambda}|\delta(i)|^\alpha\triangledown_Q |\delta(i)|$
Observation 4 (LAP and PAL Have the same desired gradient )
Proof. From Corollary 1 we have:
$\begin{aligned} \mathcal{L}_{\text {PAL }}(\delta(i)) &=\frac{1}{\lambda}|p r(i)|_{\times} \mathcal{L}_{\text {Huber }}(\delta(i)) \\ &=\frac{1}{\lambda}\left|\max \left(|\delta(i)|^{\alpha}, 1\right)\right|_{\times} \mathcal{L}_{\text {Huber }}(\delta(i)) \\ &=\frac{1}{\lambda}\left|\max \left(|\delta(i)|^{\alpha}, 1\right)\right|_{\times} \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ |\delta(i)| & \text { otherwise, }\end{cases} \\ &=\frac{1}{\lambda} \begin{cases}0.5 \delta(i)^{2} & \text { if }|\delta(i)| \leq 1, \\ \frac{|\delta(i)|^{1+\alpha}}{1+\alpha} & \text { otherwise, }\end{cases} \end{aligned}$ where
$\lambda=\frac{\sum_{j} p r(j)}{N}=\frac{\sum_{j} \max \left(|\delta(j)|^{\alpha}, 1\right)}{N} .$ Then by Corollary 1, LAP and PAL have the same expected gradient.
be aware ,PAL The power of the defined loss will never be greater than 2, It means PER The outlier deviation of has been eliminated . In those areas where the advantage of reducing variance due to priority playback is not valued ,PAL It should be with LAP With similar performance , And the implementation is simpler

3. experimental result

The author will LAP/PAL And TD3 and SAC Method combination , stay MuJoCo The test results on continuous control tasks are as follows
Found an increase in LAP/PAL after TD3 The performance of ; increase PER Because of PER + MSE The introduction of deviation leads to performance degradation ;LAP and PAL No big difference , It shows that there is almost no benefit to do non-uniform playback at this time （ The only advantage is to reduce variance , But reinforcement learning often doesn't pay much attention to this ）
The author will LAP/PAL And DDQN Method combination , stay Atari The test results on the task are as follows

Found an increase in LAP/PAL after DDQN The performance of ; increase LAP Post performance exceeds PER, But add PAL It's too late PER, This may be because Atria The environment needs a long track and Reward sparse , At this time, we still need to do the priority playback , Make guidance more dense
Besides , The author is still there MuJoCo The environment combines TD3 Yes PAL Ablation experiments were carried out , be aware $\alpha=0$ when PAL It's just a function $\frac{1}{\lambda}$ Put the shrunk Huber Loss , The experimental results show that the complete version PAL Achieve the highest performance ; Get rid of $\frac{1}{\lambda}$ Of PAL second ; stay Huber Increase in losses $\frac{1}{\lambda}$ Can significantly improve performance , This explanation The performance improvement mainly comes from the adjustment of the desired gradient

4. summary

The main contribution of this paper is
1. The equivalence between non-uniform playback and uniform playback , The conversion method is given . This is very important , Starting with the converted uniform playback loss function, the design of a new empirical playback method is more intuitive and simple , After this article, many articles about experience replay are done from this perspective
2. The theoretical analysis foundation of non-uniform experience playback technology is established , similar PER, Before that Many experience replay methods are heuristic methods , Can be analyzed and optimized with this method
3. Illustrates the PER The reason for performance degradation for continuous control tasks is PER combination MSE The gradient will be biased towards outliers when loss is used , And two improved methods are designed
4. A good non-uniform playback method , It is equal to the expected gradient Uniform replay loss about $|\delta(i)|$ The power of should be in 1 To 2 Between , Thus in “ Robustness ” and “ correctness ” Balance between （ This also applies when the loss is estimated by directly designing the uniform weight ）
5. Really do The main advantage of non-uniform playback is that it can reduce the variance of gradient , But this advantage is often unimportant ; The advantage of non-uniform playback in sparse reward environment may be the most obvious , Because the transformed equivalent loss function relies on importance sampling to balance the expected gradient direction （ This is similar to maximum likelihood estimation ）, Very inaccurate in case of small sample size , However, the effective sample size in the sparse reward environment is very small , At this time, we still have to do non-uniform sampling to improve the performance
The analysis of this paper is based on Bellman Equation On the basis of , Can it be right? Bellman Optimal Equation Establish a similar analytical framework

原网站

版权声明
本文为[Cloud fff]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/160/202206090856349039.html

当前位置：网站首页>Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

List of articles

1. PER background

1.1 Experience Replay

1.2 Prioritized Experience Replay

2. Methods of this paper

2.1 The theoretical analysis

2.1.1 Formal description of the problem

2.1.2 Theorem 1（ The condition of equal expectation gradient transformation ）

2.1.2.1 inference 1（ Construct uniform playback loss with equal expected gradient $\mathcal{L}_1$ Methods ）

2.1.2.2 inference 2（ Construct equal expectation gradient priority replay priority $p r$ And the corresponding loss function $\lambda\mathcal{L}_2$ Methods ）

2.1.3 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

2.1.3.1 Observe 1（ The variance decreases ）

2.1.3.2 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

2.1.4 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.1.4.1 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.1.4.2 inference 3（Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.2 Put forward the method

2.2.1 Priority replay method LAP

2.2.2 Uniform replay method PAL

3. experimental result

4. summary

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay

List of articles

1. PER background

1.1 Experience Replay

1.2 Prioritized Experience Replay

2. Methods of this paper

2.1 The theoretical analysis

2.1.1 Formal description of the problem

2.1.2 Theorem 1（ The condition of equal expectation gradient transformation ）

2.1.2.1 inference 1（ Construct uniform playback loss with equal expected gradient L 1 \mathcal{L}_1 L1​ Methods ）

2.1.2.2 inference 2（ Construct equal expectation gradient priority replay priority p r pr pr And the corresponding loss function λ L 2 \lambda\mathcal{L}_2 λL2​ Methods ）

2.1.3 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

2.1.3.1 Observe 1（ The variance decreases ）

2.1.3.2 Theorem 2（ The gradient variance can be reduced by converting to an equivalent priority replay scheme ）

2.1.4 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.1.4.1 Theorem 3（ In a broad sense Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.1.4.2 inference 3（Huber Loss + PER The equivalent loss function of the corresponding uniform reloading ）

2.2 Put forward the method

2.2.1 Priority replay method LAP

2.2.2 Uniform replay method PAL

3. experimental result

4. summary

边栏推荐

猜你喜欢

随机推荐

2.1.2.1 inference 1（ Construct uniform playback loss with equal expected gradient $\mathcal{L}_1$ Methods ）

2.1.2.2 inference 2（ Construct equal expectation gradient priority replay priority $p r$ And the corresponding loss function $\lambda\mathcal{L}_2$ Methods ）