当前位置:网站首页>论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
2022-07-06 23:35:00 【hei_hei_hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
概述
- 发表:ACMM 2021
- idea:使用X-Linear Attention,借鉴XLAN的思路对Multi-modality Feature进行融合,提出一种multi-path XLAN模型能够对多个单模态特征进行融合,得到一种较好的融合后的特征。此外在视频理解预训练模型比赛中通过数据扩充技术以及集成multi-path XLAN(early fuse)和微调pretrained OPT(late fuse)获得第一
详细设计
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
几乎考虑到了视频中所有模态的特征,包括:
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
感觉就是OPT+XLAN,几乎没什么改动
F x F_x Fx表示输入特征, E x E_x Ex主要是将各种模态特征嵌入到相同的语义隐藏空间, E n c o d e r x Encoder_x Encoderx是XLAN encoder
这里的 A G G i n AGG_in AGGin和 A G G c t x AGG_ctx AGGctx表示聚合方式,有以下几种选择方式:average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
对pretrained Omni-Perception Pre-Trainer model (OPT)进行微调。
- OPT
分别使用三个encoder对文本、图片、声音进行编码并将特征转换到相同的latent space;然后使用transformer对三种特征进行融合(inter- and intra interactions),然后接入text decoder 和 visual decoder分别生成文本和图片。同时设计了token-level、modality-level和sample-level的任务以让模型具有跨模态理解和生成的能力。作者在这上面使用MSR-VTT数据集进行微调。
实验
- Ablation Studies
S P SP SP指直接将multi-modality features concate然后进行reduce dimension到1024然后输入encoder-decoder的XLAN/Transformer modal中 - Comparison to State-of-the-art
+ R L +RL +RL表示微调的时候使用了reinforcement learning
边栏推荐
- Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
- 漏电继电器JD1-100
- QT控件样式系列(一)之QSlider
- DJ-ZBS2漏电继电器
- Y58. Chapter III kubernetes from entry to proficiency - continuous integration and deployment (Sany)
- Tencent cloud database public cloud market ranks top 2!
- np. random. Shuffle and np Use swapaxis or transfer with caution
- MySQL数据库学习(7) -- pymysql简单介绍
- 阿里云的神龙架构是怎么工作的 | 科普图解
- Torch optimizer small parsing
猜你喜欢
Operand of null-aware operation ‘!‘ has type ‘SchedulerBinding‘ which excludes null.
MySQL数据库学习(8) -- mysql 内容补充
JHOK-ZBL1漏电继电器
Annotation初体验
Zhang Ping'an: accelerate cloud digital innovation and jointly build an industrial smart ecosystem
Torch optimizer small parsing
CentOS 7.9 installing Oracle 21C Adventures
照片选择器CollectionView
SQL injection HTTP header injection
Initial experience of annotation
随机推荐
说一说MVCC多版本并发控制器?
Auto. JS get all app names of mobile phones
《2》 Label
局部变量的数组初始化问题
Vector and class copy constructors
How can professional people find background music materials when doing we media video clips?
QT simple layout box model with spring
window定时计划任务
Development thoughts of adding new requirements in secondary development
LabVIEW is opening a new reference, indicating that the memory is full
高级程序员必知必会,一文详解MySQL主从同步原理,推荐收藏
Error: No named parameter with the name ‘foregroundColor‘
TabLayout修改自定义的Tab标题不生效问题
batch size设置技巧
Window scheduled tasks
Let f (x) = Σ x^n/n^2, prove that f (x) + F (1-x) + lnxln (1-x) = Σ 1/n^2
Timer创建定时器
Design, configuration and points for attention of network unicast (one server, multiple clients) simulation using OPNET
Two methods of thread synchronization
实现网页内容可编辑