当前位置:网站首页>【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation
【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation
2022-07-28 05:01:00 【AI frontier theory group @ouc】

The paper :https://arxiv.org/abs/2203.15332
Code :https://github.com/GeWu-Lab/OGM-GE_CVPR2022
This is a student from Renmin University GeWu-Lab The job of , By CVPR2022 Receive and select as Oral Presentation, The relevant code has been open source .
1、 Research motivation
Using multimodal data for classification helps to improve classification performance , however , In fact, the existing methods do not effectively mine the performance of multiple modal data .( As shown in the figure below , In the multi-modal model, the performance of the specific modal encoder is not as good as that of the single-mode model , This shows that the existing models are insufficient for the mining of single-mode feature representation )

The author thinks that , The main reason for this phenomenon is : One of the two modes plays a leading role , The other plays an auxiliary role . The dominant mode will inhibit the optimization of another mode . therefore , To solve this problem , The author puts forward OGM-GE(on-the-fly gradient modulation, generalization enhancement).
The author gives a very vivid example , In the speed skating team chase competition , It records the team achievements with the last member of the team : Usually, the dominant mode players lack contact and mutual assistance with the slower weak mode players , Even if you reach the end first ( Training convergence ) Also have to stop and wait for the weak players ( Pictured 2 It's shown in the top half of ); But by OGM-GE Regulation of methods , Let the dominant players slow down and take the weak players . The weak players are helped ( For example, factors such as wind borrowing in skating competitions ) The speed has improved , In this way, the overall performance of the team can be improved ( Pictured 2 As shown in the lower part of ).

2、 The main method
In order to solve the optimization imbalance problem in multimodal classification , The author adaptively modulates the gradient according to the effect difference between modes , Combined with the generalization of Gaussian noise to enhance the ability , Put forward a more applicable OGM-GE An optimization method , The algorithm framework of the whole method is shown in the figure below .

First , Adjust the gradient of each mode adaptively by monitoring the difference of their contributions to the learning objectives . In order to effectively measure the difference of single-mode representation ability in multimodal models , The author designed the index of modal difference ratio ; In the training process , According to the difference ratio between the modes , Dynamically assign different gradient scale coefficients to different modes , So in the whole training process , Can adjust and control the optimization process of the model adaptively .
secondly , Through analysis , The authors found that the gradient addition is less than 1 The intensity of random gradient noise in the optimization process will be weakened by adjusting the coefficient of , Then it may potentially affect the generalization ability of the model . therefore , Noise reinforcement strategy is added on the basis of gradient control , That is, Gaussian noise is added to the gradient , To restore ( Even enhance on the original basis ) Random gradient noise intensity , So as to improve the generalization performance of the model .
(1)on-the-fly gradient modulation
There are two modes encoder , use φ u \varphi^{u} φu Express , among u ∈ { a , v } u \in \{a,v\} u∈{ a,v}, They correspond to each other audio and video Two modes . among encoder The parameters for θ u \theta^u θu. When using gradient descent method for optimization, it is :
θ t + 1 u = θ t u − η ∇ θ u L ( θ t u ) . \theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L(\theta_{t}^{u}). θt+1u=θtu−η∇θuL(θtu).
therefore , The author's idea is , It can adaptively adjust the optimization speed of each mode , So define a difference ratio ρ t u \rho^u_t ρtu:
s i a = ∑ k = 1 M 1 k = y i ⋅ softmax ( W t a ⋅ φ t a ( θ a , x i a ) + b 2 ) k , s_{i}^{a}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{a}_{t} \cdot \varphi^{a}_{t}(\theta^{a},x_{i}^{a})+\frac{b}{2})_{k}, sia=k=1∑M1k=yi⋅softmax(Wta⋅φta(θa,xia)+2b)k,
s i v = ∑ k = 1 M 1 k = y i ⋅ softmax ( W t v ⋅ φ t v ( θ v , x i v ) + b 2 ) k , s_{i}^{v}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{v}_{t} \cdot \varphi^{v}_{t}(\theta^{v},x_{i}^{v})+\frac{b}{2})_{k}, siv=k=1∑M1k=yi⋅softmax(Wtv⋅φtv(θv,xiv)+2b)k,
ρ t v = ∑ i ∈ B t s i v ∑ i ∈ B t s i a . \rho^{v}_{t}=\frac{\sum_{i \in B_{t}} s_{i}^{v} } {\sum_{i \in B_{t}} s^{a}_i}. ρtv=∑i∈Btsia∑i∈Btsiv.
Because there are two modes audio and video, ρ t a \rho^a_t ρta Defined as ρ t v \rho^v_t ρtv Reciprocal . Author use ρ t u \rho^u_t ρtu Dynamically monitor the contribution difference between multiple modes , The adaptive adjustment gradient is derived from the following formula :
k t u = { 1 − tanh ( α ⋅ ρ t u ) ρ t u > 1 1 others, k^{u}_{t}=\left\{\begin{array}{cl} 1-\tanh (\alpha \cdot \rho^{u}_{t}) & \text { }\rho ^{u}_{t}>1 \\ 1 & \text { others, }\end{array}\right. ktu={ 1−tanh(α⋅ρtu)1 ρtu>1 others,
among , α \alpha α It's a super parameter . The author will k t u k^u_t ktu Used in SGD In the optimization method , In iteration , The process of updating network parameters is as follows :
θ t + 1 u = θ t u − η ⋅ k t u g ~ ( θ t u ) . \theta^{u}_{t+1} =\theta_{t}^{u}-\eta \cdot k_{t}^{u}\tilde{g}(\theta_{t}^{u}). θt+1u=θtu−η⋅ktug~(θtu).
It can be seen that , adopt k t u k^u_t ktu Reduces the optimization of dominant modes , Other modes are not affected , So as to alleviate the problem of modal imbalance .
(2)generalization enhancement
Theorem :SGD The noise in is closely related to its generalization ability ,SGD The louder the noise , The better generalization ability .SGD The covariance of noise is proportional to the ratio of learning rate to batch size .
According to the theorem , The higher the value of the gradient covariance , It usually brings better generalization ability . But after the author's derivation ,OGM Method will cause SGD The generalization performance of decreases . So it is necessary to develop a control SGD Noise to improve generalization ability .
The author introduces Gaussian noise with random sampling h ( θ t u ) h(\theta^u_t) h(θtu) To gradient , The iteration formula becomes :
θ t + 1 u = θ t u − η ∇ θ u L ′ ( θ t u ) + η ξ t ′ , ξ t ′ ∼ N ( 0 , ( k t u ) 2 ⋅ Σ s g d ( θ t u ) ) , \theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L^{\prime}(\theta_{t}^{u})+\eta \xi_{t}^{\prime}, \\\xi_{t}^{\prime} \sim \mathcal{N}(0,(k_{t}^{u})^{2}\cdot\Sigma^{sgd}(\theta_{t}^{u})), θt+1u=θtu−η∇θuL′(θtu)+ηξt′,ξt′∼N(0,(ktu)2⋅Σsgd(θtu)),
The whole algorithm flow is as follows :

The code is as follows :
a, v, out = model(spec, image)
out_v = (torch.mm(v, torch.transpose(model.fusion_module.fc_out.weight[:, :512], 0, 1)) +
model.fusion_module.fc_out.bias / 2)
out_a = (torch.mm(a, torch.transpose(model.fusion_module.fc_out.weight[:, 512:], 0, 1)) +
model.fusion_module.fc_out.bias / 2)
score_v = sum([softmax(out_v)[i][label[i]] for i in range(out_v.size(0))])
score_a = sum([softmax(out_a)[i][label[i]] for i in range(out_a.size(0))])
ratio_v = score_v / score_a
ratio_a = 1 / ratio_v
if ratio_v > 1:
coeff_v = 1 - tanh(args.alpha * relu(ratio_v))
coeff_a = 1
else:
coeff_a = 1 - tanh(args.alpha * relu(ratio_a))
coeff_v = 1
for name, parms in model.named_parameters():
layer = str(name).split('.')[1]
if 'audio' in layer and len(parms.grad.size()) == 4:
parms.grad *= coeff_a
parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)
if 'visual' in layer and len(parms.grad.size()) == 4:
parms.grad *= coeff_v
parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)
3、 experimental result
First of all, the author will OGM-GE The method is applied to several common fusion methods : baseline、concatenation and summation, And specially designed fusion methods :FiLM . The performance is shown in the following table . As can be seen from the table , The performance of the two modes is unbalanced ,audio Modal performance is obviously better than visual Modality . junction a close OGM-GE After method , You can see that the performance of the model has been significantly improved .

And other modulation strategy Comparison of . The author and modality-dropout and gradient-blending Compared with ( As shown in the table 2 Shown ), You can see all modulation Methods have achieved performance improvement ,

Ablation experiments are very interesting , stay VGGSound On dataset , In limine OGM-GE The performance of the method is not as good as that when it is not used , But in the end OGM-GE The method achieves a significant improvement in performance . This is because at the beginning, our method mined the information of another mode well in the later stage , The performance is improved . In the figure 3 in , It shows ρ a \rho^a ρa The change of , It can be seen that , Used OGM-GE Training model ( The blue line ), The unbalance ratio between modes is significantly smaller than that of the directly trained model ( The yellow line ). However, due to natural differences in the amount of information between different modes , This ratio may not be close 1, It can only be reduced within a reasonable range .


边栏推荐
- 如何在 FastReport VCL 中通过 Outlook 发送和接收报告?
- Applet import project
- 启发国内学子学习少儿机器人编程教育
- How to analyze fans' interests?
- 动态sql和分页
- App test process and test points
- Leetcode 15. sum of three numbers
- 基于MPLS构建虚拟专网的配置实验
- Youxuan database participated in the compilation of the Research Report on database development (2022) of the China Academy of communications and communications
- Introduction to testcafe
猜你喜欢

FreeRTOS个人笔记-任务通知

数据安全逐步落地,必须紧盯泄露源头
![[每日一氵]上古年代的 Visual Studio2015 安装](/img/b1/066ed0b9e93b8f378c89ee974163e5.png)
[每日一氵]上古年代的 Visual Studio2015 安装

Data security is gradually implemented, and we must pay close attention to the source of leakage

FreeRTOS learning (I)

The first artificial intelligence security competition starts. Three competition questions are waiting for you to fight

Rendering process, how the code becomes a page (I)

100 lectures on Excel practical application cases (XI) - tips for inserting pictures in Excel

Installing MySQL under Linux

提升学生群体中的STEAM教育核心素养
随机推荐
linux下安装mysql
POJ 3728 the merchant (online query + double LCA)
Leetcode 454. Adding four numbers II
Analyze the emotional elements contained in intelligent sweeping robot
[function document] torch Histc and paddle Histogram and numpy.histogram
How to analyze fans' interests?
Gym 101911c bacteria (minimum stack)
Array or object, date operation
Can plastics comply with gb/t 2408 - Determination of flammability
alter和confirm,prompt的区别
Flink mind map
Depth traversal and breadth traversal of tree structure in JS
Alibaba interview question [Hangzhou multi tester] [Hangzhou multi tester _ Wang Sir]
RT based_ Distributed wireless temperature monitoring system of thread (I)
解析智能扫地机器人中蕴含的情感元素
Comprehensively analyze the differences between steam and maker Education
Look at the experience of n-year software testing summarized by people who came over the test
基于MPLS构建虚拟专网的配置实验
Design and development of C language ATM system project
Applet import project