当前位置：网站首页>Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)

Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)

2022-07-06 22:37:00 【Rainylt】

paper: https://arxiv.org/pdf/2203.15332.pdf

One sentence summary ： Solve the problem that the dominant mode is trained too fast during multimodal training, resulting in insufficient training of auxiliary mode
Cross entropy loss function ：
Insert picture description here
among ,f(x) by

Decoupling ：

among ,a Express audio Modality ,v Express visual Modality ,f(x) by softmax The first two modes are jointly output logits. In this task a Is the dominant mode , namely about gt Category ,a Modal output logits Bigger
With $W^a$ For example ,L Yes $W^a$ Derivation ：
Insert picture description here
You can see , According to the chain derivation rule , $\varphi^a$ Is with the a Modal dependent output , $\frac{\partial{L}}{\partial{f(x_i)}}$ The value of is the same for both modes , Therefore, the impact on Different modes Of Gradient difference Is the latter part , That is to say $\varphi$ Value . Due to the generally dominant mode output logits Higher , namely $\varphi$ and $W$ It's worth more , Therefore, the gradient of reverse transmission is also larger , Convergence is also faster .

Therefore, the dominant mode may appear. Train first ,loss Lower , Auxiliary mode has not been well trained . Specifically, why can't the auxiliary mode be trained well , To be explored .

For this article , in order to Deceleration dominates modal training , So when we find the gradient, we add Attenuation coefficient , Reduce the gradient of dominant mode backpropagation , It is equivalent to reducing the learning rate of the dominant mode alone ：
Insert picture description here
Use two modes to output respectively logits Of softmax After score Ratio to determine
Make the ratio greater than 1 Of （ Dominant mode ） Set the attenuation factor k(0~1), The auxiliary mode is 1（ unchanged ）

Multiply with the learning rate , Equivalent to reducing the learning rate
Insert picture description here
Besides , according to SGD Gradient back propagation process , The gradient can be pushed to the original gradient + Gaussian noise ：

The higher the learning rate => The greater the covariance of Gaussian noise => The stronger the generalization ability . Reducing the learning rate here is equivalent to weakening the generalization ability of the dominant mode . The gradient after adding the attenuation coefficient , The variance is reduced to the original k^2 times ：
Insert picture description here

therefore , This paper artificially adds a Gaussian noise , variance =batch Variance of inner sample ：
Insert picture description here

Insert picture description here
The covariance equivalent to noise is larger than before ：

原网站

版权声明
本文为[Rainylt]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/187/202207061453416326.html

当前位置：网站首页>Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)

Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)

边栏推荐

猜你喜欢

随机推荐