当前位置:网站首页>Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
2022-07-06 22:37:00 【Rainylt】
paper: https://arxiv.org/pdf/2203.15332.pdf
One sentence summary : Solve the problem that the dominant mode is trained too fast during multimodal training, resulting in insufficient training of auxiliary mode
Cross entropy loss function :
among ,f(x) by 
Decoupling :
among ,a Express audio Modality ,v Express visual Modality ,f(x) by softmax The first two modes are jointly output logits. In this task a Is the dominant mode , namely about gt Category ,a Modal output logits Bigger
With W a W^a Wa For example ,L Yes W a W^a Wa Derivation :
You can see , According to the chain derivation rule , φ a \varphi^a φa Is with the a Modal dependent output , ∂ L ∂ f ( x i ) \frac{\partial{L}}{\partial{f(x_i)}} ∂f(xi)∂L The value of is the same for both modes , Therefore, the impact on Different modes Of Gradient difference Is the latter part , That is to say φ \varphi φ Value . Due to the generally dominant mode output logits Higher , namely φ \varphi φ and W W W It's worth more , Therefore, the gradient of reverse transmission is also larger , Convergence is also faster .
Therefore, the dominant mode may appear. Train first ,loss Lower , Auxiliary mode has not been well trained . Specifically, why can't the auxiliary mode be trained well , To be explored .
For this article , in order to Deceleration dominates modal training , So when we find the gradient, we add Attenuation coefficient , Reduce the gradient of dominant mode backpropagation , It is equivalent to reducing the learning rate of the dominant mode alone :
Use two modes to output respectively logits Of softmax After score Ratio to determine
Make the ratio greater than 1 Of ( Dominant mode ) Set the attenuation factor k(0~1), The auxiliary mode is 1( unchanged )
Multiply with the learning rate , Equivalent to reducing the learning rate 
Besides , according to SGD Gradient back propagation process , The gradient can be pushed to the original gradient + Gaussian noise :
The higher the learning rate => The greater the covariance of Gaussian noise => The stronger the generalization ability . Reducing the learning rate here is equivalent to weakening the generalization ability of the dominant mode . The gradient after adding the attenuation coefficient , The variance is reduced to the original k^2 times :
therefore , This paper artificially adds a Gaussian noise , variance =batch Variance of inner sample :

The covariance equivalent to noise is larger than before :
边栏推荐
- hdu 5077 NAND(暴力打表)
- 【雅思口语】安娜口语学习记录part1
- 枚举与#define 宏的区别
- Void keyword
- rust知识思维导图xmind
- Attack and defense world ditf Misc
- Senior soft test (Information System Project Manager) high frequency test site: project quality management
- General implementation and encapsulation of go diversified timing tasks
- pytorch_YOLOX剪枝【附代码】
- Return keyword
猜你喜欢

Web APIs DOM time object

MySQL数据库基本操作-DML

0 basic learning C language - digital tube

How to confirm the storage mode of the current system by program?

(18) LCD1602 experiment

Mise en place d'un environnement de développement OP - tee basé sur qemuv8

Sword finger offer question brushing record 1

手写ABA遇到的坑

CocosCreator+TypeScripts自己写一个对象池

Mysql database basic operations DML
随机推荐
自定义 swap 函数
Aardio - 不声明直接传float数值的方法
The difference between enumeration and define macro
TypeScript获取函数参数类型
Traversal of a tree in first order, middle order, and then order
uniapp滑动到一定的高度后固定某个元素到顶部效果demo(整理)
signed、unsigned关键字
qt quick项目offscreen模式下崩溃的问题处理
Leetcode exercise - Sword finger offer 26 Substructure of tree
Installation and use of labelimg
SQL Server生成自增序号
POJ 1258 Agri-Net
Config:invalid signature solution and troubleshooting details
2014阿里巴巴web前实习生项目分析(1)
空结构体多大?
【踩坑合辑】Attempting to deserialize object on CUDA device+buff/cache占用过高+pad_sequence
HDU 5077 NAND (violent tabulation)
Build op-tee development environment based on qemuv8
npm无法安装sharp
hdu 5077 NAND(暴力打表)