当前位置:网站首页>Improving Multimodal Accuracy Through Modality Pre-training and Attention
Improving Multimodal Accuracy Through Modality Pre-training and Attention
2022-07-06 22:37:00 【Rainylt】
paper:
It is found that the convergence speed of different modes of the multimodal model is inconsistent , So they pre train separately , Reuse attention( Not self-attn) Get the weights of different modes , Multiply by the weight concat->FC->logits
First of all, let's talk about attention. No self-attention That kind of Q*K The mechanism of , It is Put the three modes directly feature concat after , too FC Get the weight :
H There are three modes (v, a, t) Of feature,shape by (3,m). Output three modes The weight
According to the author's observation , Direct training of multimodal models , Of different modes Loss The descent speed is inconsistent ( Convergence rate ):
The three figures are different data sets , The second and third datasets are slightly better , The first data set is text Convergence too fast .
Look at the weights of different modes :
You can see that the first data set is before the pre training text Account for most of the weight , Maybe because it is more important , It may also be because of his feature Better quality . After pre training video Catch up , explain video Before it was just feature It's just not well trained .
Three modes , Who is more important is to As the case may be Of :
All three modes here can show fear
here text and audio Can show surprise , but img No way. , The expression is relatively flat ( It's hard to say )
This example is better , Although he is apologizing , But actually I was laughing , It should be a happy mood , So this should be audio Dominant
What this article puts forward attention Weight and these importance can correspond :
边栏推荐
- MySQL数据库基本操作-DML
- Use ECs to set up an agent
- OpenCV VideoCapture. Get() parameter details
- Build op-tee development environment based on qemuv8
- Balanced Multimodal Learning via On-the-fly Gradient Modulation(CVPR2022 oral)
- How to use flexible arrays?
- Project duplicate template
- Inno Setup 打包及签名指南
- Sizeof keyword
- MATLAB小技巧(27)灰色预测
猜你喜欢
随机推荐
memcached
剪映+json解析将视频中的声音转换成文本
2014 Alibaba web pre intern project analysis (1)
uniapp设置背景图效果demo(整理)
The difference between enumeration and define macro
Mysql database basic operations DML
枚举与#define 宏的区别
2022-07-04 the high-performance database engine stonedb of MySQL is compiled and run in centos7.9
Puppeter connects to the existing Chrome browser
POJ 1258 Agri-Net
Void keyword
柔性数组到底如何使用呢?
Aardio - 通过变量名将变量值整合到一串文本中
rust知识思维导图xmind
Advantages of link local address in IPv6
Aardio - does not declare the method of directly passing float values
npm无法安装sharp
[IELTS speaking] Anna's oral learning record part1
GD32F4XX串口接收中断和闲时中断配置
case 关键字后面的值有什么要求吗?