当前位置:网站首页>Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
2022-07-06 22:37:00 【Rainylt】
paper: https://arxiv.org/pdf/2203.15332.pdf
One sentence summary : Solve the problem that the dominant mode is trained too fast during multimodal training, resulting in insufficient training of auxiliary mode
Cross entropy loss function :
among ,f(x) by
Decoupling :
among ,a Express audio Modality ,v Express visual Modality ,f(x) by softmax The first two modes are jointly output logits. In this task a Is the dominant mode , namely about gt Category ,a Modal output logits Bigger
With W a W^a Wa For example ,L Yes W a W^a Wa Derivation :
You can see , According to the chain derivation rule , φ a \varphi^a φa Is with the a Modal dependent output , ∂ L ∂ f ( x i ) \frac{\partial{L}}{\partial{f(x_i)}} ∂f(xi)∂L The value of is the same for both modes , Therefore, the impact on Different modes Of Gradient difference Is the latter part , That is to say φ \varphi φ Value . Due to the generally dominant mode output logits Higher , namely φ \varphi φ and W W W It's worth more , Therefore, the gradient of reverse transmission is also larger , Convergence is also faster .
Therefore, the dominant mode may appear. Train first ,loss Lower , Auxiliary mode has not been well trained . Specifically, why can't the auxiliary mode be trained well , To be explored .
For this article , in order to Deceleration dominates modal training , So when we find the gradient, we add Attenuation coefficient , Reduce the gradient of dominant mode backpropagation , It is equivalent to reducing the learning rate of the dominant mode alone :
Use two modes to output respectively logits Of softmax After score Ratio to determine
Make the ratio greater than 1 Of ( Dominant mode ) Set the attenuation factor k(0~1), The auxiliary mode is 1( unchanged )
Multiply with the learning rate , Equivalent to reducing the learning rate
Besides , according to SGD Gradient back propagation process , The gradient can be pushed to the original gradient + Gaussian noise :
The higher the learning rate => The greater the covariance of Gaussian noise => The stronger the generalization ability . Reducing the learning rate here is equivalent to weakening the generalization ability of the dominant mode . The gradient after adding the attenuation coefficient , The variance is reduced to the original k^2 times :
therefore , This paper artificially adds a Gaussian noise , variance =batch Variance of inner sample :
The covariance equivalent to noise is larger than before :
边栏推荐
猜你喜欢
On the problems of born charge and non analytical correction in phonon and heat transport calculations
Aardio - 不声明直接传float数值的方法
【编译原理】做了一半的LR(0)分析器
树的先序中序后序遍历
Crawler obtains real estate data
NPDP certification | how do product managers communicate across functions / teams?
金融人士必读书籍系列之六:权益投资(基于cfa考试内容大纲和框架)
Rust knowledge mind map XMIND
二分图判定
That's why you can't understand recursion
随机推荐
UDP programming
空结构体多大?
Void keyword
POJ 1094 sorting it all out
Sizeof keyword
Netxpert xg2 helps you solve the problem of "Cabling installation and maintenance"
Traversal of a tree in first order, middle order, and then order
新手程序员该不该背代码?
项目复盘模板
Crawler obtains real estate data
树的先序中序后序遍历
qt quick项目offscreen模式下崩溃的问题处理
剑指offer刷题记录1
Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
云原生技术--- 容器知识点
ThreadLocal详解
Inno setup packaging and signing Guide
(18) LCD1602 experiment
Web APIs DOM 时间对象
NPDP认证|产品经理如何跨职能/跨团队沟通?