当前位置：网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]

Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]

2022-07-07 05:34:00 【hei_ hei_ hei_】

publish ：ACMM 2021
idea： Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN（early fuse） And fine tuning pretrained OPT（late fuse） Get first

Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .

OPT

Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features （inter- and intra interactions）, Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .

Ablation Studies

$S P$ Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in
Comparison to State-of-the-art

$+ R L$ Indicates that it is used in fine-tuning reinforcement learning

版权声明
本文为[hei_ hei_ hei_]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207062335134406.html