当前位置:网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
2022-07-07 05:34:00 【hei_ hei_ hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
summary
- publish :ACMM 2021
- idea: Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN(early fuse) And fine tuning pretrained OPT(late fuse) Get first
Detailed design
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
Almost all modal features in the video are considered , Include :
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
The feeling is OPT+XLAN, Almost nothing has changed

F x F_x Fx Represents the input feature , E x E_x Ex It mainly embeds various modal features into the same semantic hidden space , E n c o d e r x Encoder_x Encoderx yes XLAN encoder
there A G G i n AGG_in AGGin and A G G c t x AGG_ctx AGGctx Indicates the aggregation method , There are several options :average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .
- OPT

Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features (inter- and intra interactions), Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .
experiment
- Ablation Studies

S P SP SP Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in - Comparison to State-of-the-art

+ R L +RL +RL Indicates that it is used in fine-tuning reinforcement learning
边栏推荐
- LabVIEW is opening a new reference, indicating that the memory is full
- 人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
- K6el-100 leakage relay
- JHOK-ZBL1漏电继电器
- 论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
- 利用OPNET进行网络单播(一服务器多客户端)仿真的设计、配置及注意点
- The year of the tiger is coming. Come and make a wish. I heard that the wish will come true
- JVM(二十) -- 性能监控与调优(一) -- 概述
- nodejs获取客户端ip
- [optimal web page width and its implementation] [recommended collection "
猜你喜欢

Use Zhiyun reader to translate statistical genetics books

论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】

ThinkPHP Association preload with

Leakage relay llj-100fs
![[binary tree] binary tree path finding](/img/34/1798111e9a294b025806a4d2d5abf8.png)
[binary tree] binary tree path finding

不同网段之间实现GDB远程调试功能

Unity让摄像机一直跟随在玩家后上方

JVM (XX) -- performance monitoring and tuning (I) -- Overview

高压漏电继电器BLD-20

JD commodity details page API interface, JD commodity sales API interface, JD commodity list API interface, JD app details API interface, JD details API interface, JD SKU information interface
随机推荐
不同网段之间实现GDB远程调试功能
Let f (x) = Σ x^n/n^2, prove that f (x) + F (1-x) + lnxln (1-x) = Σ 1/n^2
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
Most commonly used high number formula
[JS component] date display.
The year of the tiger is coming. Come and make a wish. I heard that the wish will come true
淘寶商品詳情頁API接口、淘寶商品列錶API接口,淘寶商品銷量API接口,淘寶APP詳情API接口,淘寶詳情API接口
LabVIEW is opening a new reference, indicating that the memory is full
一条 update 语句的生命经历
batch size设置技巧
JVM(二十) -- 性能监控与调优(一) -- 概述
JHOK-ZBL1漏电继电器
Leetcode (417) -- Pacific Atlantic current problem
Leetcode: maximum number of "balloons"
利用OPNET进行网络仿真时网络层协议(以QoS为例)的使用、配置及注意点
nodejs获取客户端ip
Egr-20uscm ground fault relay
CVE-2021-3156 漏洞复现笔记
Zero sequence aperture of leakage relay jolx-gs62 Φ one hundred
Use Zhiyun reader to translate statistical genetics books