当前位置：网站首页>Formatting the generalization forgetting trade off in continuous learning

Formatting the generalization forgetting trade off in continuous learning

2022-06-12 07:17:00 【Programmer long】

One . Introduce

In continuous learning , Learn sequential observed tasks by gradually adapting to a model . It has two main objectives ： Keep long-term memory and keep learning new knowledge . And that is stability - The dilemma of plasticity , And continuous learning is trying to find a balance between the two .
Conventional CL Methods either minimize catastrophic forgetting , Or improve rapid generalization , Without simultaneous consideration . Common can be divided into three categories ：(1) Dynamic network structure ,(2) Regularization approach ,(3) Memory replay is the way of scene reproduction . The solution to the generalization problem includes the way to express learning （ For example, meta learning ）.
These traditional CL The method only minimizes a loss function , However, the trade-offs in their re optimization settings are not explicitly considered . The first person to consider this work is meta-experience replay(MER) In the algorithm, , Among them, forget - Generalization is proposed as a way of gradient alignment . But in MER in , The balance of forgetting generalization is achieved by several super parameters . therefore , There are two challenges , One is the lack of universal theory tools to study the existence and stability of an equilibrium point . The second is the lack of Pan system method to achieve this balance point . Therefore, this paper is to improve this .
The author describes a framework , In this framework , First of all, will CL The problem is defined as a sequential decision problem , And seek to minimize the total cost function of the whole model life cycle . At any time , If future missions are not available , The calculation of cost function becomes very difficult . To solve this problem , utilize Bellman The optimality principle of , Yes CL Re model the problem , To simulate the catastrophic forgetting cost of previous tasks and the generalization cost of new tasks . And then make improvement according to this , A balanced continuous learning algorithm is proposed (BCL).

Two . Problem definition

In this paper $\mathbb{R,N}$ Represents a real number field and a positive integer field . $\parallel . \parallel$ Denotes the universal number of a matrix . All in all $[0,K],K\in \mathbb{N}$ A mission . The model is $g (.)$ , Accept $\theta_k$ Parameters predict the data . A task （ Suppose it's the k A mission ）, Expressed as $\mathcal{Q}_k = \{\mathcal{X_k,l_k}\}$ , among l For the corresponding loss . The representation of the parameter sequence is $u_{k:K}=\{\theta_{\tau}\in\Omega_{\theta},k\le\tau\le K\}$ , among , $\Omega_\theta$ Represents a compact set .* Expressed as the optimal value ( for example $\theta^*$ ).
Here the author gives a diagram to describe continuous learning ：
1111
In order to make continuous learning concrete , Here a consumption is defined (cost)： Forgotten by disaster cost And generalization . This cost In every task k There's always ： $J_k(\theta_k) = \gamma_k\ell_k+\sum^{k-1}_{\tau=0}\gamma_\tau \ell_\tau$ . there $\ell_\tau$ Be calculated in the task $\mathcal{Q}\tau$ , Describe the contribution of this task to the sum . In order to solve the problem in k The problem at this time on the task , We need to find $\theta_k$ bring $J_k(\theta_k)$ Minimum . So for the whole task sequence , We need to keep it to a minimum . Therefore, the whole optimization problem of continuous learning is accumulation cost： $V_k(u_{k:K})=\sum^K_{\tau=k}\beta_\tau J_\tau(\theta_\tau)$ , Make it the smallest :
$V^{(*)}_k=\min _{u_{k:K}}V_k(u_{k:K}) \tag{1}$
In this formula , There are two parameters that determine the contribution of the task ： $\gamma_\tau$ , I didn't think about the contribution of the task in the past . $\beta_\tau$ , Contribution to future missions . To solve this problem ,(1) Must be continuously differentiable . Then the author proves that it is differentiable in the case that infinite tasks cannot be guaranteed , At the same time, it is also said that a with infinite memory CL The problem is np problem ,CL Can't provide perfect performance on a large number of tasks , And the task must have a priority . So it would be wrong just to naively reduce this loss , And because the previous data is not visible , It is not possible to evaluate previously processed tasks .

3、 ... and . Continuous learning of dynamic programming

Firstly, the representation of dynamic programming is introduced ( That is, the optimal cost for k The performance of the )：
$\nabla_k(V_k^{(*)})=-\min_{\theta_{k}\in\Omega_\theta}[\beta_k J_k(\theta_k)+(<\nabla_{\theta_k}V_k^{(*)},\Delta\theta_k>+<\nabla_{x_k}V_k^{(*)},\Delta x_k>)] \tag{2}$
(2) The derivation of is omitted here , If you are interested, you can read the original . In the formula $\nabla_k(V_k^{(*)})$ Express $V_k^{(*)}$ All disturbances in . in other words , When it equals 0 Explain the new task k Will not affect our current solution , let me put it another way , The optimal solution for the previous task is also the optimal solution for the new task . So the smaller the disturbance , The better the performance of the model for all tasks , So the goal becomes ： Minimize all disturbances . stay (2) in , This disturbance has three parts ： Cost distribution for previous tasks plus tasks k： $J_k(\theta_k)$ , Parameter changes $<\nabla_{\theta_k}V_k^{(*)},\Delta\theta_k>$ , And the introduction of new tasks ( Data sets ) And lead to the change of optimal cost $<\nabla_{x_k}V_k^{(*)},\Delta x_k>$ .
Optimize (2) The task is to find the solution of continuous learning . We use it $H(\Delta x_k,\theta_k) = \beta_k J_k(\theta_k)+(<\nabla_{\theta_k}V_k^{(*)},\Delta\theta_k>+<\nabla_{x_k}V_k^{(*)},\Delta x_k>)$ . That is, optimize $H(\Delta x_k,\theta_k)$ It means to minimize the interference caused by new tasks .
Next , We must balance generalization and forgetting for each task k. When the model successfully adapts to the new task , It shows generalization . The degree of generalization depends on the difference between the previous task and the new task . The greater the difference between the two follow-up tasks , The more general the model must be . therefore , Worst case differences suggest maximum generalization . However , Larger differences increase forgetfulness , The worst-case difference produces the greatest forgetfulness . At the same time make (2) The value of the third term in is the largest , Maximum generalization can be obtained , therefore $\Delta x_k$ Quantifying changes in subsequent tasks . However $\Delta x_k=x_{k+1}-x_k$ , We don't know $x_{k+1}$ When the first k A task . So use update $\Delta x_k$ The gradient rises to estimate the difference in the worst case . Next , Update by iterating using gradient descent $\theta_k$ To minimize maximum generalization forgetting .
Next , To describe the above problem . Superscript i It means the first one k Iteration of task No i round .
Insert picture description here
stay (3) in , Try to find a group $(\Delta x_k^{(*)},\theta_k^{(*)})$ , among $\Delta x_k^{(*)}$ send H Maximize , and $\theta_k^{(*)}$ send H To minimize the （ This is what the author said above , Two player, Find the balance ）. $(\Delta x_k^{(*)},\theta_k^{(*)})$ For our solution , To meet the following conditions ：
$H(\Delta x_k^{(*)},\theta_k^{(i)})\ge H(\Delta x_k^{(*)},\theta_k^{(*)}) \ge H(\Delta x_k^{(i)},\theta_k^{(*)}) \tag{4}$

3.1 The theoretical analysis

Now the goal is how to find the point we want , To achieve balance .
Insert picture description here
The above figure briefly describes the author's thinking . First, fix it at first $\theta$ , Construct a neighborhood $\mathcal{M}_k = \theta_k^{(.)}\in\Omega_x\}$ , And then to $\Delta x_k^{(i)}$ , Find a local maximum in this interval by using the gradient rise . Next , Lock what we found $\Delta x_k^{(.)}$ . Update according to gradient descent . Here the author proves that in $\mathcal{N}_k=\{\Omega_\theta,\Delta x_k^{(.)}\}$ In this area ,H Is the minimum , At the same time, it converges to this region . Finally, in the above two areas , There is at least one equilibrium point here .

3.2 Balanced continuous learning (BCN)

Based on this , Let's change $H(\Delta x_k^{(i)},\theta^{(i)}_k)\approx \beta_k J_k(\theta_k^{(i)})+(J_k(\theta_k^{(i+\zeta)}) - J_k(\theta_k^{(i)}) )+(J_{k+\zeta}(\theta_k^{(i)}) -J_k(\theta_{k}^{(i)}) )$ . among $J_{k+\zeta}$ It's about updating player 1, $\theta_k^{(i+\zeta)}$ It's an update player2.
Insert picture description here
Two of them player Use the above 5 Select and update . The specific algorithm is shown in the figure below ：

among $\mathcal{D}_N(k)$ This is a new task , $\mathcal{D}_P(k)$ Indicates that samples are taken from past tasks . For each of these batch $b_N\in \mathcal{D}_N(k)$ , We take samples at the same time $b_P \in \mathcal{D}_P(k)$ . Combine them to form $b_{PN}(k)$ . Then update it according to the above idea .