当前位置:网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
2022-07-07 05:34:00 【hei_ hei_ hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
summary
- publish :ACMM 2021
- idea: Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN(early fuse) And fine tuning pretrained OPT(late fuse) Get first
Detailed design
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
Almost all modal features in the video are considered , Include :
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
The feeling is OPT+XLAN, Almost nothing has changed

F x F_x Fx Represents the input feature , E x E_x Ex It mainly embeds various modal features into the same semantic hidden space , E n c o d e r x Encoder_x Encoderx yes XLAN encoder
there A G G i n AGG_in AGGin and A G G c t x AGG_ctx AGGctx Indicates the aggregation method , There are several options :average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .
- OPT

Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features (inter- and intra interactions), Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .
experiment
- Ablation Studies

S P SP SP Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in - Comparison to State-of-the-art

+ R L +RL +RL Indicates that it is used in fine-tuning reinforcement learning
边栏推荐
- Pytest testing framework -- data driven
- Use, configuration and points for attention of network layer protocol (taking QoS as an example) when using OPNET for network simulation
- Educational Codeforces Round 22 B. The Golden Age
- Flink SQL 实现读写redis,并动态生成Hset key
- ThinkPHP Association preload with
- 1.AVL树:左右旋-bite
- 人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
- 1. AVL tree: left-right rotation -bite
- Annotation初体验
- Let f (x) = Σ x^n/n^2, prove that f (x) + F (1-x) + lnxln (1-x) = Σ 1/n^2
猜你喜欢

Leakage relay jelr-250fg

How does mapbox switch markup languages?

JHOK-ZBG2漏电继电器

Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format

张平安:加快云上数字创新,共建产业智慧生态
![[论文阅读] Semi-supervised Left Atrium Segmentation with Mutual Consistency Training](/img/d6/e6db0d76e81e49a83a30f8c1832f09.png)
[论文阅读] Semi-supervised Left Atrium Segmentation with Mutual Consistency Training

A cool "ghost" console tool

Phenomenon analysis when Autowired annotation is used for list

利用OPNET进行网络单播(一服务器多客户端)仿真的设计、配置及注意点

Lombok插件
随机推荐
高级程序员必知必会,一文详解MySQL主从同步原理,推荐收藏
Jhok-zbg2 leakage relay
利用OPNET进行网络仿真时网络层协议(以QoS为例)的使用、配置及注意点
Use Zhiyun reader to translate statistical genetics books
什么是依赖注入(DI)
Two methods of thread synchronization
漏电继电器JOLX-GS62零序孔径Φ100
ssm框架的简单案例
Taobao Commodity details page API interface, Taobao Commodity List API interface, Taobao Commodity sales API interface, Taobao app details API interface, Taobao details API interface
app clear data源码追踪
ThinkPHP Association preload with
Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
分布式事务解决方案之TCC
[论文阅读] Semi-supervised Left Atrium Segmentation with Mutual Consistency Training
If you want to choose some departments to give priority to OKR, how should you choose pilot departments?
做自媒体,有哪些免费下载视频剪辑素材的网站?
Talk about mvcc multi version concurrency controller?
做自媒体视频剪辑,专业的人会怎么寻找背景音乐素材?
漏电继电器JELR-250FG
Tablayout modification of customized tab title does not take effect