当前位置：网站首页>Missing getting in online continuous learning with neuron calibration thesis analysis + code reading

Missing getting in online continuous learning with neuron calibration thesis analysis + code reading

2022-06-12 07:18:00 【Programmer long】

The address of the paper is here
This is an article on Baidu research

One . Introduce

In response to catastrophic forgetting , This article focuses on replay based approaches ( For example, as we said before GEM and MEGA Two articles ). Allow the model to have limited access to data from past tasks , So as to rehearse the past experience . However, the playback based method is easy to lead to data imbalance , That is, stability - Plasticity dilemma . One side , The model may be affected by the past knowledge so that it cannot learn new knowledge quickly , On the other hand , The knowledge of the past may fade away in learning .
In this paper , The author solves this problem from a new angle , That is to seek the balance between stability and plasticity through neuronal calibration . Specifically, neuron calibration refers to the process of mathematical adjustment of the transformation function of each layer of the deep neural network . The purpose of neuron calibration is to regularize parameter updating by setting a trainable soft mask to prevent catastrophic forgetting , Then the forward reasoning path and the reverse optimization path affect the model reasoning process and the model training process . in other words , This paper trains a shared calibration model , Interweave data from different task distributions , So as to effectively optimize the model , Instead of keeping the parameters of a specific task, save the task knowledge to prevent forgetting .

Two . Related work

Deal with catastrophic forgetting according to existing methods , At present, it is mainly divided into three categories , as follows .
Based on the way of episodic memory playback ： Store part of the past data in episodic memory , For future knowledge rehearsal . Memory based approach can better solve catastrophic forgetting , But if memory and actual conditions are limited , It's easy to be disturbed .
Based on the way of regularization playback ： By extending the loss function in continuous learning , To facilitate selective consolidation of past knowledge stored in model parameters . This approach uses trade-off parameter information , Identify parameters that are more important to past tasks , To avoid forgetting .
Dynamic architecture ： The catastrophic forgetting problem is solved by approximately training a separate network for each task .

3、 ... and . NCCL( Neuron calibration for online continuous learning )

Symbol definition
$\{\mathcal{T_1,...}\mathcal{T}_T\}:$ Represents an online continuous learning task sequence . Each task is given a small amount of storage space to store past data .
$\mathcal{M}_t$ : It means from training to t A task , Saved about t Partial data of tasks .
$\{\theta_i\}^L_{i=1}：$ share L Layer neural networks , Parameters of each layer

3.1 Neuron calibration

Calibrate by applying neurons , The goal is to adapt to the transformation function in the deep neural network layer , So as to effectively mitigate the catastrophic changes of model parameters , Achieve a stable range of knowledge from different tasks . say concretely , In this paper, two commonly used layers are transformed ： Full connection layer and convolution layer . Here, the author gives a diagram to illustrate how to work .
Insert picture description here
In the figure, two ways of transformation are mentioned . The first is the weight calibration module (WCM), The second is the feature calibration module (FCM). The weight calibration module learns the weights of the parameters in the scaling transform function , The feature calibration module learns the output feature mapping of scaling transform function prediction . For the sake of illustration , use $\theta_i$ Express WCM Parameters before , $\tilde{\theta}_i$ Pass through WCM Parameters of . Empathy $h_i,\tilde{h}_i$ Express FCM Mapping of output features before and after .

WCM

set up $\Omega_{\psi_i}(.)$ Indicates that the weight calibration function is deployed in the i Layer network , Its parameter is $\psi_i$ . Weight calibration is expected to be modular , Using cell multiplication , Applied between basic network parameters and calibration parameters . Specifically as follows ：
$\Omega_{\psi_i}(\theta_i)=\begin{dcases} tile(\psi_i)\odot \theta_i & \psi_i\in\mathbb{R}^{O*I} \ \ \ \text{(Convolution Layer)}\\ tile(\psi_i)\odot \theta_i &\psi_i\in\mathbb{R}^{O} \ \ \ \ \ \ \text{(Fully Connected Layer)} \tag{1} \end{dcases}$
among O and I Indicates the number of channels for output and input . To reduce calibration parameters $\psi_i$ , Its size ratio $\theta_i$ Many small , Therefore use tile function （ A repeating placement function , You can search on Baidu ） To expand . Weight calibration method , Riding a crucial role ： In forward propagation , It scales the base network parameter values for prediction . In the back propagation optimization process , It regularizes the update of important parameters as a priority weight （ $\nabla_{\theta_1}\mathcal{L_b}$ In order to $\nabla_{\tilde{\theta_i}}\mathcal{L_b}\odot tile(\psi_1)$ For export , Is the scaling with calibrator parameters .）
After the calibration of the weight , Our first i The output of the layer is ：
$h_i = \mathcal{F}_{\tilde{\theta}_i }(\tilde{h}_{i-1}) \ \ \ \ \ s.t\ \ \ \tilde\theta_i =\Omega_{\psi_i}(\theta_i) \tag{2}$

FCM

after WCM And layer processing and activation , We get a feature output , Next, we need to perform the feature output FCM. Use $\Omega_{\lambda_i}(.)$ Express FCM function . It's going on FCM when , The calibration parameters are multiplied by the output characteristics , As follows ：
$\Omega_{\lambda_i}(h_i)=\begin{dcases} tile(\lambda_i)\odot h_i& \lambda_i\in\mathbb{R}^{O} \ \ \ \text{(Convolution Layer)}\\ \lambda_i\odot h_i&\lambda_i\in\mathbb{R}^{O} \ \ \ \ \ \ \text{(Fully Connected Layer)} \tag{3} \end{dcases}$
After processing , similar resnet equally , Add the two feature outputs .
therefore , from i-1 Layer to i The complete processing of the layer is as follows ：
$\tilde{h}_i = \sigma(\ \mathcal{BN}\ (\ \Omega_{\lambda_i}\ (\mathcal{F}_{\tilde{\theta}_i }(\tilde{h}_{i-1}) )\oplus \mathcal{F}_{\tilde{\theta}_i }(\tilde{h}_{i-1}))) \ \ \ \ \ s.t\ \ \ \tilde\theta_i =\Omega_{\psi_i}(\theta_i) \tag{4}$
BN by batch normalization, $\sigma$ Is the activation function

3.2 Parameter learning

After the layer is processed , We need to transform our loss function accordingly to better update the parameters . According to EWC Medium fisher Information as a basis for processing . The consolidation process takes place when training basic model parameters to absorb new knowledge , And rehearse the past knowledge by reproducing the data in the episodic memory , The following loss calculations can be made ：
$\mathcal{L_c}(\{\psi,\lambda,\theta\},(x,y,k)) = \underbrace{\frac{1}{2}vec(\tilde{\theta}-\tilde{\theta}^t)^T\Lambda_t(\tilde{\theta}-\tilde{\theta}^t)}_{term(a)}+\underbrace{\beta D_{KL}(S(\frac{\hat{z}}{\tau}) \parallel S(\frac{\hat{z_k}}{\tau}))}_{term(b)} \tag{5}$
Loss of which $\beta$ Is an equilibrium parameter ,S(.) For one softmax function , $\tau$ by softmax Distillation temperature , $\hat{z}$ The predicted value for the current task , $\hat{z^k}$ Forecast for previous tasks .vec(.) Is to store the corresponding content in the data .
$\Lambda_t$ by EWC Medium fisher information From the loss of knowledge distillation in storage .term(a) It is the freezing of the parts of the weight that has guaranteed to deal with catastrophic forgetting , and term（b） For stability while training .

3.3 Optimize

NCCL The optimization of is similar to maml, It is divided into internal optimization and external optimization . Internal optimization update $\theta$ , And external optimization updates $\psi,\lambda$ . The optimization objective is ：
$\text{Outer Loop: }(\psi^*,\lambda^*) = argmin_{(\psi,\lambda)}\mathcal{L_c}((\psi,\lambda),\theta^*,\mathcal{M}_{<t}) \tag{6}$
$\text{InnerLoop: }\theta^* = argmin_{\theta}\mathcal{L_b}((\psi,\lambda),\theta,\mathcal{M}_{<=t}) \tag{7}$
Finally, set the learning rate to update the learning .
The complete algorithm process is shown in the figure ：
Insert picture description here

Four . Code reading

The author's github Code point here
The difficulty of this article lies in the construction of network layer , That is to say WCM and FCM Two pieces of , And layer processing , So first let's look at this code .
The structure of each floor is shown in the figure
$\tilde{h}_i = \sigma(\ \mathcal{BN}\ (\ \Omega_{\lambda_i}\ (\mathcal{F}_{\tilde{\theta}_i }(\tilde{h}_{i-1}) )\oplus \mathcal{F}_{\tilde{\theta}_i }(\tilde{h}_{i-1}))) \ \ \ \ \ s.t\ \ \ \tilde\theta_i =\Omega_{\psi_i}(\theta_i)$
among WCM by ：
$\Omega_{\psi_i}(\theta_i)=\begin{dcases} tile(\psi_i)\odot \theta_i & \psi_i\in\mathbb{R}^{O*I} \ \ \ \text{(Convolution Layer)}\\ tile(\psi_i)\odot \theta_i &\psi_i\in\mathbb{R}^{O} \ \ \ \ \ \ \text{(Fully Connected Layer)} \end{dcases}$
FCM by ：
$\Omega_{\lambda_i}(h_i)=\begin{dcases} tile(\lambda_i)\odot h_i& \lambda_i\in\mathbb{R}^{O} \ \ \ \text{(Convolution Layer)}\\ \lambda_i\odot h_i&\lambda_i\in\mathbb{R}^{O} \ \ \ \ \ \ \text{(Fully Connected Layer)} \end{dcases}$
First define the parameters of each layer according to the code ：
The author here is to put two CNN The layer is defined as a layer after processing , As follows ：

class CalibratedBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1, activation='relu', norm='batch_norm', downsample=None):
        super(CalibratedBlock, self).__init__()
        ##  first floor CNN
        self.conv1 = conv3x3(in_planes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        ##  The second floor CNN
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.stride = stride
        self.sigma = 0.05
        self.downsample=downsample
		##  Define adjustment parameters ,cw(WCM Weight adjustment in ) cb( Convolutional bias) cf(FCM The output map of )
        self.calib_w_conv1 = torch.nn.Parameter(torch.ones(planes, in_planes, 1, 1, ), requires_grad = True)
        self.calib_b_conv1 = torch.nn.Parameter(torch.zeros([planes]), requires_grad = True)
        self.calib_f_conv1 = torch.nn.Parameter(torch.ones([1, planes, 1, 1]), requires_grad = True)
        self.calib_w_conv2 = torch.nn.Parameter(torch.ones(planes, planes, 1, 1, ), requires_grad = True)
        self.calib_b_conv2 = torch.nn.Parameter(torch.zeros([planes, 1, 1, 1]), requires_grad = True)
        self.calib_f_conv2 = torch.nn.Parameter(torch.ones([1, planes, 1, 1, ]), requires_grad = True)
		##  Put it in the model 
        self.register_parameter('calib_w_conv1', self.calib_w_conv1)
        self.register_parameter('calib_b_conv1', self.calib_b_conv1)
        self.register_parameter('calib_f_conv1', self.calib_f_conv1)
        self.register_parameter('calib_w_conv2', self.calib_w_conv2)
        self.register_parameter('calib_b_conv2', self.calib_b_conv2)
        self.register_parameter('calib_f_conv2', self.calib_f_conv2)

        #  In addition, I will do an ordinary conv
        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion * planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion * planes)
            )
        self.activation = activation
        self.norm = norm

Next, let's look at the specific forward The process ：

def forward(self, x):

    # compute the mask first
    if self.activation == 'relu':
        activation = nn.functional.relu
    elif self.activation == 'leaky_relu':
        activation = nn.functional.leaky_relu
    else:
        activation = None

    [dim0, dim1] = self.conv1.weight.shape[2:]
    calibrated_conv1 = self.calib_w_conv1
	##  perform tail operation , expand \psi
    this_ss_weights = torch.tile(calibrated_conv1, (1, 1, dim0, dim1))
    ##  Multiply to update our w
    cw = self.conv1.weight * this_ss_weights
    ##  Updated w Then perform convolution operation 
    conv_output = torch.nn.functional.conv2d(x, cw, stride=self.stride,
                                             padding=1, bias=self.calib_b_conv1.squeeze())
	##  To calculate the h after , Conduct FCM
    [dim1, dim2] = conv_output.shape[2:]
    this_scale_weights = torch.tile(self.calib_f_conv1, (conv_output.shape[0], 1, dim1, dim2))
    conv_output = conv_output * this_scale_weights
    # normalize
    if self.norm == 'batch_norm':
        normed = self.bn1(conv_output)
    elif self.norm == 'layer_norm':
        normed = torch.nn.functional.layer_norm(conv_output)
    else:
        normed = conv_output
    ##  Finally, activate 
    out = activation(normed)
	####  Next, the same as above , To the last round out  Do it again WCM FCM Complete operation 
    # second conv layer
    [dim0, dim1] = self.conv2.weight.shape[2:]
    # epsilon_weight = torch.randn(self.masked_conv2.shape).to(self.masked_conv2.device) * self.sigma
    calibrated_conv2 = self.calib_w_conv2 #+ epsilon_weight * self.masked_conv2_sigma

    this_ss_weights = torch.tile(calibrated_conv2, (1, 1, dim0, dim1))
    cw = self.conv2.weight * this_ss_weights

    # < resnet_conv_block_scale>
    conv_output = torch.nn.functional.conv2d(out, cw,
                                             stride=1, padding=1, bias=self.calib_b_conv2.squeeze())

    [dim1, dim2] = conv_output.shape[2:]
    this_scale_weights = torch.tile(self.calib_f_conv2, (conv_output.shape[0], 1, dim1, dim2))
    conv_output = conv_output * this_scale_weights
    # normalize
    if self.norm == 'batch_norm':
        normed = self.bn2(conv_output)
    elif self.norm == 'layer_norm':
        normed = torch.nn.functional.layer_norm(conv_output)
    else:
        normed = conv_output
    out = activation(normed)

    # residual
    ##  The value of the original 
    residual = self.shortcut(x)
    ##  Add directly 
    return out + residual

Then let's look at the corresponding inner update and outer update
First of all inner update. Assume that the current is t A mission , According to Article t Tasks use crossentropy To calculate the loss1, And then from the data we store [1,t) Select part of the data in the ( Here the author uses random selection , That is, the first task is to extract several data , The second task is to extract several data , And so on ). Then we calculate the loss2, Finally, calculate loss3=KL The divergence .loss2 and loss3 As learning from old tasks , Add to loss1 in . The specific process is as follows ：

for step in range(self.inner_steps):
    self.zero_grad()
    self.opt.zero_grad()
    offset1, offset2 = self.compute_offsets(t)
    copy_net = copy.deepcopy(self.net)
    #  Select data from the current task and calculate the loss 
    if step == 0:
        pred = self.forward(x, t)
        pred = pred[:, offset1:offset2]
        yy = y - offset1
    elif self.count >= step * self.batch_size:
        xx, yy, _, mask, list_t = self.memory_sampling(t, self.batch_size, intra_class=True)
        pred = self.net(xx)
        pred = torch.gather(pred, 1, mask)
    else:
        pred = self.forward(x, t)
        pred = pred[:, offset1:offset2]
        yy = y - offset1
        # return 0.0
    loss1 = self.bce(pred, yy)
    ##  Pick data from old tasks , And calculate the loss 
    if t > 0:
        xx, yy, feat, mask, list_t = self.memory_sampling(t, self.replay_batch_size)
        pred_ = self.net(xx)
        pred = torch.gather(pred_, 1, mask)
        ##  Here is the loss of the old mission 
        loss2 = self.bce(pred, yy)
        ##  Calculate divergence ,feat For stored previous data softmax Value 
        loss3 = self.reg * self.kl(F.log_softmax(pred / self.temp, dim=1), feat)
        loss = loss1 + (loss2 + loss3) * self.gamma
    else:
        loss = loss1
	##  Gradient update 
    grads = torch.autograd.grad(loss, self.net.base_param(), create_graph=True, allow_unused=True, retain_graph=True)

    #  Update only \theta
    num_none, num_grad = 0, 0
    for param, grad in zip(self.net.base_param(), grads):
        if grad is not None:
            new_param = param.data.clone()
            if self.inner_clip > 0:
                grad.data.clamp_(-self.inner_clip, self.inner_clip)
            new_param = new_param - self.inner_lr * grad
            param.data.copy_(new_param)
            num_grad += 1
        else:
            num_none += 1

inner After the update , We need to deal with it outer The loss of .outer Similarly, some old task data should be taken , Calculate according to the old task data KL Divergent loss, According to this loss After calculating the gradient , Then use the gradient to calculate EWC Medium fisher Information . Last use fisher The loss of information computation regularization (term (a)), Use this loss to update our $\psi and \lambda$ that will do .

if t > 0:
    self.net.zero_grad()
    self.opt.zero_grad()
    xval, yval, feat, mask, list_t = self.memory_sampling(t, self.batch_size)
    pred_ = self.net(xval)
    pred_ = torch.gather(pred_, 1, mask)

    # 1st loss update
    outer_loss = self.reg * self.kl(F.log_softmax(pred_ / self.temp, dim=1), feat)
    outer_grad = torch.autograd.grad(outer_loss, self.net.context_param() + self.net.base_weight_params(),
                                     retain_graph=True, allow_unused=True,)

    # 2nd loss update
    old_masked_params, _, _ = copy_net.base_and_calibrated_params()
    cur_masked_params, cur_tiled_mask_params, cur_base_params = self.net.base_and_calibrated_params()
    reg = self.beta * self.reg #* self.reg
    ewc_loss = 0.0
    num_meta_params = len(self.net.context_param())
    for ii, p in enumerate(cur_masked_params):
    	## Here's the calculation fisher Information  = (p.grad / tile(\psi) )^2
        pg = (outer_grad[num_meta_params + ii].data/(cur_tiled_mask_params[ii].data +1e-12)).pow(2)
        cur_loss = reg * pg.detach() * (p - old_masked_params[ii].data.clone()).pow(2)
        ewc_loss += cur_loss.sum()
    ewc_loss.backward()
    self.opt.step()