当前位置：网站首页>【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation

【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation

2022-07-28 05:01:00 【AI frontier theory group @ouc】

Please add a picture description

The paper ：https://arxiv.org/abs/2203.15332
Code ：https://github.com/GeWu-Lab/OGM-GE_CVPR2022

This is a student from Renmin University GeWu-Lab The job of , By CVPR2022 Receive and select as Oral Presentation, The relevant code has been open source .

1、 Research motivation

Using multimodal data for classification helps to improve classification performance , however , In fact, the existing methods do not effectively mine the performance of multiple modal data .（ As shown in the figure below , In the multi-modal model, the performance of the specific modal encoder is not as good as that of the single-mode model , This shows that the existing models are insufficient for the mining of single-mode feature representation ）

Please add a picture description

The author thinks that , The main reason for this phenomenon is ： One of the two modes plays a leading role , The other plays an auxiliary role . The dominant mode will inhibit the optimization of another mode . therefore , To solve this problem , The author puts forward OGM-GE（on-the-fly gradient modulation, generalization enhancement）.

The author gives a very vivid example , In the speed skating team chase competition , It records the team achievements with the last member of the team ： Usually, the dominant mode players lack contact and mutual assistance with the slower weak mode players , Even if you reach the end first （ Training convergence ） Also have to stop and wait for the weak players （ Pictured 2 It's shown in the top half of ）; But by OGM-GE Regulation of methods , Let the dominant players slow down and take the weak players . The weak players are helped （ For example, factors such as wind borrowing in skating competitions ） The speed has improved , In this way, the overall performance of the team can be improved （ Pictured 2 As shown in the lower part of ）.

2、 The main method

In order to solve the optimization imbalance problem in multimodal classification , The author adaptively modulates the gradient according to the effect difference between modes , Combined with the generalization of Gaussian noise to enhance the ability , Put forward a more applicable OGM-GE An optimization method , The algorithm framework of the whole method is shown in the figure below .

First , Adjust the gradient of each mode adaptively by monitoring the difference of their contributions to the learning objectives . In order to effectively measure the difference of single-mode representation ability in multimodal models , The author designed the index of modal difference ratio ; In the training process , According to the difference ratio between the modes , Dynamically assign different gradient scale coefficients to different modes , So in the whole training process , Can adjust and control the optimization process of the model adaptively .

secondly , Through analysis , The authors found that the gradient addition is less than 1 The intensity of random gradient noise in the optimization process will be weakened by adjusting the coefficient of , Then it may potentially affect the generalization ability of the model . therefore , Noise reinforcement strategy is added on the basis of gradient control , That is, Gaussian noise is added to the gradient , To restore （ Even enhance on the original basis ） Random gradient noise intensity , So as to improve the generalization performance of the model .

（1）on-the-fly gradient modulation

There are two modes encoder , use $\varphi^{u}$ Express , among $\in \{a,v\}$ , They correspond to each other audio and video Two modes . among encoder The parameters for $\theta^u$ . When using gradient descent method for optimization, it is ：

$\theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L(\theta_{t}^{u}).$

therefore , The author's idea is , It can adaptively adjust the optimization speed of each mode , So define a difference ratio $\rho^u_t$ :

$s_{i}^{a}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{a}_{t} \cdot \varphi^{a}_{t}(\theta^{a},x_{i}^{a})+\frac{b}{2})_{k},$

$s_{i}^{v}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{v}_{t} \cdot \varphi^{v}_{t}(\theta^{v},x_{i}^{v})+\frac{b}{2})_{k},$

$\rho^{v}_{t}=\frac{\sum_{i \in B_{t}} s_{i}^{v} } {\sum_{i \in B_{t}} s^{a}_i}.$

Because there are two modes audio and video, $\rho^a_t$ Defined as $\rho^v_t$ Reciprocal . Author use $\rho^u_t$ Dynamically monitor the contribution difference between multiple modes , The adaptive adjustment gradient is derived from the following formula ：

$k^{u}_{t}=\left\{\begin{array}{cl} 1-\tanh (\alpha \cdot \rho^{u}_{t}) & \text { }\rho ^{u}_{t}>1 \\ 1 & \text { others, }\end{array}\right.$

among , $\alpha$ It's a super parameter . The author will $k^u_t$ Used in SGD In the optimization method , In iteration , The process of updating network parameters is as follows ：

$\theta^{u}_{t+1} =\theta_{t}^{u}-\eta \cdot k_{t}^{u}\tilde{g}(\theta_{t}^{u}).$

It can be seen that , adopt $k^u_t$ Reduces the optimization of dominant modes , Other modes are not affected , So as to alleviate the problem of modal imbalance .

（2）generalization enhancement

Theorem ：SGD The noise in is closely related to its generalization ability ,SGD The louder the noise , The better generalization ability .SGD The covariance of noise is proportional to the ratio of learning rate to batch size .

According to the theorem , The higher the value of the gradient covariance , It usually brings better generalization ability . But after the author's derivation ,OGM Method will cause SGD The generalization performance of decreases . So it is necessary to develop a control SGD Noise to improve generalization ability .

The author introduces Gaussian noise with random sampling $h(\theta^u_t)$ To gradient , The iteration formula becomes ：

$\theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L^{\prime}(\theta_{t}^{u})+\eta \xi_{t}^{\prime}, \\\xi_{t}^{\prime} \sim \mathcal{N}(0,(k_{t}^{u})^{2}\cdot\Sigma^{sgd}(\theta_{t}^{u})),$

The whole algorithm flow is as follows ：

The code is as follows ：

a, v, out = model(spec, image)

out_v = (torch.mm(v, torch.transpose(model.fusion_module.fc_out.weight[:, :512], 0, 1)) +
	 model.fusion_module.fc_out.bias / 2)
out_a = (torch.mm(a, torch.transpose(model.fusion_module.fc_out.weight[:, 512:], 0, 1)) +
	 model.fusion_module.fc_out.bias / 2)


score_v = sum([softmax(out_v)[i][label[i]] for i in range(out_v.size(0))])
score_a = sum([softmax(out_a)[i][label[i]] for i in range(out_a.size(0))])

ratio_v = score_v / score_a
ratio_a = 1 / ratio_v

if ratio_v > 1:
    coeff_v = 1 - tanh(args.alpha * relu(ratio_v))
    coeff_a = 1
else:
    coeff_a = 1 - tanh(args.alpha * relu(ratio_a))
    coeff_v = 1

for name, parms in model.named_parameters():
    layer = str(name).split('.')[1]

    if 'audio' in layer and len(parms.grad.size()) == 4:
	parms.grad *= coeff_a
	parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)

    if 'visual' in layer and len(parms.grad.size()) == 4:
	parms.grad *= coeff_v
	parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)

3、 experimental result

First of all, the author will OGM-GE The method is applied to several common fusion methods ： baseline、concatenation and summation, And specially designed fusion methods ：FiLM . The performance is shown in the following table . As can be seen from the table , The performance of the two modes is unbalanced ,audio Modal performance is obviously better than visual Modality . junction a close OGM-GE After method , You can see that the performance of the model has been significantly improved .

And other modulation strategy Comparison of . The author and modality-dropout and gradient-blending Compared with （ As shown in the table 2 Shown ）, You can see all modulation Methods have achieved performance improvement ,

Ablation experiments are very interesting , stay VGGSound On dataset , In limine OGM-GE The performance of the method is not as good as that when it is not used , But in the end OGM-GE The method achieves a significant improvement in performance . This is because at the beginning, our method mined the information of another mode well in the later stage , The performance is improved . In the figure 3 in , It shows $\rho^a$ The change of , It can be seen that , Used OGM-GE Training model （ The blue line ）, The unbalance ratio between modes is significantly smaller than that of the directly trained model （ The yellow line ）. However, due to natural differences in the amount of information between different modes , This ratio may not be close 1, It can only be reduced within a reasonable range .