当前位置:网站首页>论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
2022-07-06 23:35:00 【hei_hei_hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
概述
- 发表:ACMM 2021
- idea:使用X-Linear Attention,借鉴XLAN的思路对Multi-modality Feature进行融合,提出一种multi-path XLAN模型能够对多个单模态特征进行融合,得到一种较好的融合后的特征。此外在视频理解预训练模型比赛中通过数据扩充技术以及集成multi-path XLAN(early fuse)和微调pretrained OPT(late fuse)获得第一
详细设计
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
几乎考虑到了视频中所有模态的特征,包括:
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
感觉就是OPT+XLAN,几乎没什么改动
F x F_x Fx表示输入特征, E x E_x Ex主要是将各种模态特征嵌入到相同的语义隐藏空间, E n c o d e r x Encoder_x Encoderx是XLAN encoder
这里的 A G G i n AGG_in AGGin和 A G G c t x AGG_ctx AGGctx表示聚合方式,有以下几种选择方式:average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
对pretrained Omni-Perception Pre-Trainer model (OPT)进行微调。
- OPT
分别使用三个encoder对文本、图片、声音进行编码并将特征转换到相同的latent space;然后使用transformer对三种特征进行融合(inter- and intra interactions),然后接入text decoder 和 visual decoder分别生成文本和图片。同时设计了token-level、modality-level和sample-level的任务以让模型具有跨模态理解和生成的能力。作者在这上面使用MSR-VTT数据集进行微调。
实验
- Ablation Studies
S P SP SP指直接将multi-modality features concate然后进行reduce dimension到1024然后输入encoder-decoder的XLAN/Transformer modal中 - Comparison to State-of-the-art
+ R L +RL +RL表示微调的时候使用了reinforcement learning
边栏推荐
- Vector and class copy constructors
- Timer创建定时器
- 线程同步的两个方法
- Window scheduled tasks
- 背包问题(01背包,完全背包,动态规划)
- Talk about mvcc multi version concurrency controller?
- JVM(十九) -- 字节码与类的加载(四) -- 再谈类的加载器
- K6EL-100漏电继电器
- Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
- Harmonyos fourth training
猜你喜欢
Autowired注解用于List时的现象解析
Is PMP really useful?
Harmonyos fourth training
U++ metadata specifier learning notes
Zhang Ping'an: accelerate cloud digital innovation and jointly build an industrial smart ecosystem
DJ-ZBS2漏电继电器
If you‘re running pod install manually, make sure flutter pub get is executed first.
Life experience of an update statement
【问道】编译原理
在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
随机推荐
《2》 Label
Knapsack problem unrelated to profit (depth first search)
Annotation初体验
c语言神经网络基本代码大全及其含义
Redis如何实现多可用区?
Record a pressure measurement experience summary
在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
AOSP ~binder communication principle (I) - Overview
How can professional people find background music materials when doing we media video clips?
【opencv】图像形态学操作-opencv标记不同连通域的位置
NPDP产品经理认证,到底是何方神圣?
SQL injection cookie injection
【oracle】简单的日期时间的格式化与排序问题
Simulate thread communication
Auto. JS get all app names of mobile phones
【js组件】自定义select
CentOS 7.9 installing Oracle 21C Adventures
利用OPNET进行网络任意源组播(ASM)仿真的设计、配置及注意点
MySQL数据库学习(7) -- pymysql简单介绍
How Alibaba cloud's DPCA architecture works | popular science diagram