当前位置:网站首页>【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation
【CVPR2022 oral】Balanced Multimodal Learning via On-the-fly Gradient Modulation
2022-07-28 05:01:00 【AI frontier theory group @ouc】

The paper :https://arxiv.org/abs/2203.15332
Code :https://github.com/GeWu-Lab/OGM-GE_CVPR2022
This is a student from Renmin University GeWu-Lab The job of , By CVPR2022 Receive and select as Oral Presentation, The relevant code has been open source .
1、 Research motivation
Using multimodal data for classification helps to improve classification performance , however , In fact, the existing methods do not effectively mine the performance of multiple modal data .( As shown in the figure below , In the multi-modal model, the performance of the specific modal encoder is not as good as that of the single-mode model , This shows that the existing models are insufficient for the mining of single-mode feature representation )

The author thinks that , The main reason for this phenomenon is : One of the two modes plays a leading role , The other plays an auxiliary role . The dominant mode will inhibit the optimization of another mode . therefore , To solve this problem , The author puts forward OGM-GE(on-the-fly gradient modulation, generalization enhancement).
The author gives a very vivid example , In the speed skating team chase competition , It records the team achievements with the last member of the team : Usually, the dominant mode players lack contact and mutual assistance with the slower weak mode players , Even if you reach the end first ( Training convergence ) Also have to stop and wait for the weak players ( Pictured 2 It's shown in the top half of ); But by OGM-GE Regulation of methods , Let the dominant players slow down and take the weak players . The weak players are helped ( For example, factors such as wind borrowing in skating competitions ) The speed has improved , In this way, the overall performance of the team can be improved ( Pictured 2 As shown in the lower part of ).

2、 The main method
In order to solve the optimization imbalance problem in multimodal classification , The author adaptively modulates the gradient according to the effect difference between modes , Combined with the generalization of Gaussian noise to enhance the ability , Put forward a more applicable OGM-GE An optimization method , The algorithm framework of the whole method is shown in the figure below .

First , Adjust the gradient of each mode adaptively by monitoring the difference of their contributions to the learning objectives . In order to effectively measure the difference of single-mode representation ability in multimodal models , The author designed the index of modal difference ratio ; In the training process , According to the difference ratio between the modes , Dynamically assign different gradient scale coefficients to different modes , So in the whole training process , Can adjust and control the optimization process of the model adaptively .
secondly , Through analysis , The authors found that the gradient addition is less than 1 The intensity of random gradient noise in the optimization process will be weakened by adjusting the coefficient of , Then it may potentially affect the generalization ability of the model . therefore , Noise reinforcement strategy is added on the basis of gradient control , That is, Gaussian noise is added to the gradient , To restore ( Even enhance on the original basis ) Random gradient noise intensity , So as to improve the generalization performance of the model .
(1)on-the-fly gradient modulation
There are two modes encoder , use φ u \varphi^{u} φu Express , among u ∈ { a , v } u \in \{a,v\} u∈{ a,v}, They correspond to each other audio and video Two modes . among encoder The parameters for θ u \theta^u θu. When using gradient descent method for optimization, it is :
θ t + 1 u = θ t u − η ∇ θ u L ( θ t u ) . \theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L(\theta_{t}^{u}). θt+1u=θtu−η∇θuL(θtu).
therefore , The author's idea is , It can adaptively adjust the optimization speed of each mode , So define a difference ratio ρ t u \rho^u_t ρtu:
s i a = ∑ k = 1 M 1 k = y i ⋅ softmax ( W t a ⋅ φ t a ( θ a , x i a ) + b 2 ) k , s_{i}^{a}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{a}_{t} \cdot \varphi^{a}_{t}(\theta^{a},x_{i}^{a})+\frac{b}{2})_{k}, sia=k=1∑M1k=yi⋅softmax(Wta⋅φta(θa,xia)+2b)k,
s i v = ∑ k = 1 M 1 k = y i ⋅ softmax ( W t v ⋅ φ t v ( θ v , x i v ) + b 2 ) k , s_{i}^{v}= \sum_{k=1}^M 1_{k=y_{i}} \cdot \text{softmax} (W^{v}_{t} \cdot \varphi^{v}_{t}(\theta^{v},x_{i}^{v})+\frac{b}{2})_{k}, siv=k=1∑M1k=yi⋅softmax(Wtv⋅φtv(θv,xiv)+2b)k,
ρ t v = ∑ i ∈ B t s i v ∑ i ∈ B t s i a . \rho^{v}_{t}=\frac{\sum_{i \in B_{t}} s_{i}^{v} } {\sum_{i \in B_{t}} s^{a}_i}. ρtv=∑i∈Btsia∑i∈Btsiv.
Because there are two modes audio and video, ρ t a \rho^a_t ρta Defined as ρ t v \rho^v_t ρtv Reciprocal . Author use ρ t u \rho^u_t ρtu Dynamically monitor the contribution difference between multiple modes , The adaptive adjustment gradient is derived from the following formula :
k t u = { 1 − tanh ( α ⋅ ρ t u ) ρ t u > 1 1 others, k^{u}_{t}=\left\{\begin{array}{cl} 1-\tanh (\alpha \cdot \rho^{u}_{t}) & \text { }\rho ^{u}_{t}>1 \\ 1 & \text { others, }\end{array}\right. ktu={ 1−tanh(α⋅ρtu)1 ρtu>1 others,
among , α \alpha α It's a super parameter . The author will k t u k^u_t ktu Used in SGD In the optimization method , In iteration , The process of updating network parameters is as follows :
θ t + 1 u = θ t u − η ⋅ k t u g ~ ( θ t u ) . \theta^{u}_{t+1} =\theta_{t}^{u}-\eta \cdot k_{t}^{u}\tilde{g}(\theta_{t}^{u}). θt+1u=θtu−η⋅ktug~(θtu).
It can be seen that , adopt k t u k^u_t ktu Reduces the optimization of dominant modes , Other modes are not affected , So as to alleviate the problem of modal imbalance .
(2)generalization enhancement
Theorem :SGD The noise in is closely related to its generalization ability ,SGD The louder the noise , The better generalization ability .SGD The covariance of noise is proportional to the ratio of learning rate to batch size .
According to the theorem , The higher the value of the gradient covariance , It usually brings better generalization ability . But after the author's derivation ,OGM Method will cause SGD The generalization performance of decreases . So it is necessary to develop a control SGD Noise to improve generalization ability .
The author introduces Gaussian noise with random sampling h ( θ t u ) h(\theta^u_t) h(θtu) To gradient , The iteration formula becomes :
θ t + 1 u = θ t u − η ∇ θ u L ′ ( θ t u ) + η ξ t ′ , ξ t ′ ∼ N ( 0 , ( k t u ) 2 ⋅ Σ s g d ( θ t u ) ) , \theta_{t+1}^{u} =\theta_{t}^{u}-\eta \nabla_{\theta^{u}} L^{\prime}(\theta_{t}^{u})+\eta \xi_{t}^{\prime}, \\\xi_{t}^{\prime} \sim \mathcal{N}(0,(k_{t}^{u})^{2}\cdot\Sigma^{sgd}(\theta_{t}^{u})), θt+1u=θtu−η∇θuL′(θtu)+ηξt′,ξt′∼N(0,(ktu)2⋅Σsgd(θtu)),
The whole algorithm flow is as follows :

The code is as follows :
a, v, out = model(spec, image)
out_v = (torch.mm(v, torch.transpose(model.fusion_module.fc_out.weight[:, :512], 0, 1)) +
model.fusion_module.fc_out.bias / 2)
out_a = (torch.mm(a, torch.transpose(model.fusion_module.fc_out.weight[:, 512:], 0, 1)) +
model.fusion_module.fc_out.bias / 2)
score_v = sum([softmax(out_v)[i][label[i]] for i in range(out_v.size(0))])
score_a = sum([softmax(out_a)[i][label[i]] for i in range(out_a.size(0))])
ratio_v = score_v / score_a
ratio_a = 1 / ratio_v
if ratio_v > 1:
coeff_v = 1 - tanh(args.alpha * relu(ratio_v))
coeff_a = 1
else:
coeff_a = 1 - tanh(args.alpha * relu(ratio_a))
coeff_v = 1
for name, parms in model.named_parameters():
layer = str(name).split('.')[1]
if 'audio' in layer and len(parms.grad.size()) == 4:
parms.grad *= coeff_a
parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)
if 'visual' in layer and len(parms.grad.size()) == 4:
parms.grad *= coeff_v
parms.grad += torch.zeros_like(parms.grad).normal_(0, parms.grad.std().item() + 1e-8)
3、 experimental result
First of all, the author will OGM-GE The method is applied to several common fusion methods : baseline、concatenation and summation, And specially designed fusion methods :FiLM . The performance is shown in the following table . As can be seen from the table , The performance of the two modes is unbalanced ,audio Modal performance is obviously better than visual Modality . junction a close OGM-GE After method , You can see that the performance of the model has been significantly improved .

And other modulation strategy Comparison of . The author and modality-dropout and gradient-blending Compared with ( As shown in the table 2 Shown ), You can see all modulation Methods have achieved performance improvement ,

Ablation experiments are very interesting , stay VGGSound On dataset , In limine OGM-GE The performance of the method is not as good as that when it is not used , But in the end OGM-GE The method achieves a significant improvement in performance . This is because at the beginning, our method mined the information of another mode well in the later stage , The performance is improved . In the figure 3 in , It shows ρ a \rho^a ρa The change of , It can be seen that , Used OGM-GE Training model ( The blue line ), The unbalance ratio between modes is significantly smaller than that of the directly trained model ( The yellow line ). However, due to natural differences in the amount of information between different modes , This ratio may not be close 1, It can only be reduced within a reasonable range .


边栏推荐
- What is the reason why the easycvr national standard protocol access equipment is online but the channel is not online?
- Gym 101911c bacteria (minimum stack)
- Look at the experience of n-year software testing summarized by people who came over the test
- 在外包公司两年了,感觉快要废了
- Supervisor series: 5. Log
- Paper reading notes -- crop yield prediction using deep neural networks
- What is the core value of testing?
- Barbie q! How to analyze the new game app?
- Euler road / Euler circuit
- 阿里怎么用DDD来拆分微服务?
猜你喜欢

如何在 FastReport VCL 中通过 Outlook 发送和接收报告?

Redis类型

The first artificial intelligence security competition starts. Three competition questions are waiting for you to fight

数据安全逐步落地,必须紧盯泄露源头

Comprehensively analyze the differences between steam and maker Education

Rendering process, how the code becomes a page (2)

05.01 string

启发国内学子学习少儿机器人编程教育

FreeRTOS learning (I)

FreeRTOS startup process, coding style and debugging method
随机推荐
Domain name (subdomain name) collection method of Web penetration
Mysql database -- first knowledge database
HDU 3078 network (lca+ sort)
[high CPU consumption] software_ reporter_ tool.exe
数据库故障容错之系统时钟故障
The go zero singleton service uses generics to simplify the registration of handler routes
Data security is gradually implemented, and we must pay close attention to the source of leakage
Service object creation and use
Printf() print char* str
Youxuan database participated in the compilation of the Research Report on database development (2022) of the China Academy of communications and communications
全方位分析STEAM和创客教育的差异化
为什么md5不可逆,却还可能被md5免费解密网站解密
MySQL 默认隔离级别是RR,为什么阿里等大厂会改成RC?
Gan: generative advantageous nets -- paper analysis and the mathematical concepts behind it
HDU 1914 the stable marriage problem
RT_ Use of thread message queue
HDU 1530 maximum clique
FreeRTOS learning (I)
面试了一位38岁程序员,听说要加班就拒绝了
MySQL(5)