当前位置:网站首页>论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
2022-07-06 23:35:00 【hei_hei_hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
概述
- 发表:ACMM 2021
- idea:使用X-Linear Attention,借鉴XLAN的思路对Multi-modality Feature进行融合,提出一种multi-path XLAN模型能够对多个单模态特征进行融合,得到一种较好的融合后的特征。此外在视频理解预训练模型比赛中通过数据扩充技术以及集成multi-path XLAN(early fuse)和微调pretrained OPT(late fuse)获得第一
详细设计
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
几乎考虑到了视频中所有模态的特征,包括:
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
感觉就是OPT+XLAN,几乎没什么改动
F x F_x Fx表示输入特征, E x E_x Ex主要是将各种模态特征嵌入到相同的语义隐藏空间, E n c o d e r x Encoder_x Encoderx是XLAN encoder
这里的 A G G i n AGG_in AGGin和 A G G c t x AGG_ctx AGGctx表示聚合方式,有以下几种选择方式:average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
对pretrained Omni-Perception Pre-Trainer model (OPT)进行微调。
- OPT
分别使用三个encoder对文本、图片、声音进行编码并将特征转换到相同的latent space;然后使用transformer对三种特征进行融合(inter- and intra interactions),然后接入text decoder 和 visual decoder分别生成文本和图片。同时设计了token-level、modality-level和sample-level的任务以让模型具有跨模态理解和生成的能力。作者在这上面使用MSR-VTT数据集进行微调。
实验
- Ablation Studies
S P SP SP指直接将multi-modality features concate然后进行reduce dimension到1024然后输入encoder-decoder的XLAN/Transformer modal中 - Comparison to State-of-the-art
+ R L +RL +RL表示微调的时候使用了reinforcement learning
边栏推荐
- 《4》 Form
- 模拟线程通信
- Harmonyos fourth training
- DBSync新增对MongoDB、ES的支持
- 项目经理如何凭借NPDP证书逆袭?看这里
- JHOK-ZBG2漏电继电器
- The founder has a debt of 1billion. Let's start the class. Is it about to "end the class"?
- 基于 hugging face 预训练模型的实体识别智能标注方案:生成doccano要求json格式
- 最长不下降子序列(LIS)(动态规划)
- Summary of the mean value theorem of higher numbers
猜你喜欢
随机推荐
AIDL 与Service
PMP证书有没有必要续期?
Torch optimizer small parsing
Autowired注解用于List时的现象解析
Auto. JS get all app names of mobile phones
The founder has a debt of 1billion. Let's start the class. Is it about to "end the class"?
Senior programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization, and recommends collecting
实现网页内容可编辑
When deleting a file, the prompt "the length of the source file name is greater than the length supported by the system" cannot be deleted. Solution
【js组件】自定义select
Simulate thread communication
Photo selector collectionview
Zhang Ping'an: accelerate cloud digital innovation and jointly build an industrial smart ecosystem
照片选择器CollectionView
Let f (x) = Σ x^n/n^2, prove that f (x) + F (1-x) + lnxln (1-x) = Σ 1/n^2
精彩速递|腾讯云数据库6月刊
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
ThinkPHP Association preload with
How does redis implement multiple zones?
Addressable pre Download