当前位置:网站首页>Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
2022-07-06 22:37:00 【Rainylt】
paper: https://arxiv.org/pdf/2203.15332.pdf
One sentence summary : Solve the problem that the dominant mode is trained too fast during multimodal training, resulting in insufficient training of auxiliary mode
Cross entropy loss function :
among ,f(x) by
Decoupling :
among ,a Express audio Modality ,v Express visual Modality ,f(x) by softmax The first two modes are jointly output logits. In this task a Is the dominant mode , namely about gt Category ,a Modal output logits Bigger
With W a W^a Wa For example ,L Yes W a W^a Wa Derivation :
You can see , According to the chain derivation rule , φ a \varphi^a φa Is with the a Modal dependent output , ∂ L ∂ f ( x i ) \frac{\partial{L}}{\partial{f(x_i)}} ∂f(xi)∂L The value of is the same for both modes , Therefore, the impact on Different modes Of Gradient difference Is the latter part , That is to say φ \varphi φ Value . Due to the generally dominant mode output logits Higher , namely φ \varphi φ and W W W It's worth more , Therefore, the gradient of reverse transmission is also larger , Convergence is also faster .
Therefore, the dominant mode may appear. Train first ,loss Lower , Auxiliary mode has not been well trained . Specifically, why can't the auxiliary mode be trained well , To be explored .
For this article , in order to Deceleration dominates modal training , So when we find the gradient, we add Attenuation coefficient , Reduce the gradient of dominant mode backpropagation , It is equivalent to reducing the learning rate of the dominant mode alone :
Use two modes to output respectively logits Of softmax After score Ratio to determine
Make the ratio greater than 1 Of ( Dominant mode ) Set the attenuation factor k(0~1), The auxiliary mode is 1( unchanged )
Multiply with the learning rate , Equivalent to reducing the learning rate
Besides , according to SGD Gradient back propagation process , The gradient can be pushed to the original gradient + Gaussian noise :
The higher the learning rate => The greater the covariance of Gaussian noise => The stronger the generalization ability . Reducing the learning rate here is equivalent to weakening the generalization ability of the dominant mode . The gradient after adding the attenuation coefficient , The variance is reduced to the original k^2 times :
therefore , This paper artificially adds a Gaussian noise , variance =batch Variance of inner sample :
The covariance equivalent to noise is larger than before :
边栏推荐
猜你喜欢
Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
Daily question 1: force deduction: 225: realize stack with queue
Pit encountered by handwritten ABA
NPDP认证|产品经理如何跨职能/跨团队沟通?
Slide the uniapp to a certain height and fix an element to the top effect demo (organize)
Improving Multimodal Accuracy Through Modality Pre-training and Attention
Export MySQL table data in pure mode
Adavit -- dynamic network with adaptive selection of computing structure
机试刷题1
NPDP certification | how do product managers communicate across functions / teams?
随机推荐
hdu 5077 NAND(暴力打表)
signed、unsigned关键字
config:invalid signature 解决办法和问题排查详解
柔性数组到底如何使用呢?
OpenSSL:适用TLS与SSL协议的全功能工具包,通用加密库
CocosCreator+TypeScripts自己写一个对象池
poj 1094 Sorting It All Out (拓扑排序)
2014阿里巴巴web前实习生项目分析(1)
General implementation and encapsulation of go diversified timing tasks
使用云服务器搭建代理
ThreadLocal详解
Heavyweight news | softing fg-200 has obtained China 3C explosion-proof certification to provide safety assurance for customers' on-site testing
rust知识思维导图xmind
Aardio - Method of batch processing attributes and callback functions when encapsulating Libraries
volatile关键字
Web APIs DOM time object
2022-07-05 use TPCC to conduct sub query test on stonedb
OpenCV VideoCapture. Get() parameter details
Classification, function and usage of MySQL constraints
Machine test question 1