当前位置:网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
2022-07-07 05:34:00 【hei_ hei_ hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
summary
- publish :ACMM 2021
- idea: Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN(early fuse) And fine tuning pretrained OPT(late fuse) Get first
Detailed design
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
Almost all modal features in the video are considered , Include :
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
The feeling is OPT+XLAN, Almost nothing has changed

F x F_x Fx Represents the input feature , E x E_x Ex It mainly embeds various modal features into the same semantic hidden space , E n c o d e r x Encoder_x Encoderx yes XLAN encoder
there A G G i n AGG_in AGGin and A G G c t x AGG_ctx AGGctx Indicates the aggregation method , There are several options :average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .
- OPT

Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features (inter- and intra interactions), Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .
experiment
- Ablation Studies

S P SP SP Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in - Comparison to State-of-the-art

+ R L +RL +RL Indicates that it is used in fine-tuning reinforcement learning
边栏推荐
- 在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
- 数字化创新驱动指南
- Tencent cloud database public cloud market ranks top 2!
- Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
- Annotation初体验
- JSP setting header information export to excel
- ThinkPHP Association preload with
- 4. Object mapping Mapster
- [论文阅读] A Multi-branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation
- NPDP产品经理认证,到底是何方神圣?
猜你喜欢

漏电继电器LLJ-100FS

数字化创新驱动指南

Autowired注解用于List时的现象解析

Record a pressure measurement experience summary

论文阅读【Open-book Video Captioning with Retrieve-Copy-Generate Network】

消息队列:消息积压如何处理?

4. 对象映射 - Mapping.Mapster

论文阅读【Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention】
![[JS component] custom select](/img/9d/f7f15ec21763c40b9bb6a053d90ee4.jpg)
[JS component] custom select

分布式事务介绍
随机推荐
Most commonly used high number formula
Batch size setting skills
Egr-20uscm ground fault relay
Zero sequence aperture of leakage relay jolx-gs62 Φ one hundred
JVM(二十) -- 性能监控与调优(一) -- 概述
基于 hugging face 预训练模型的实体识别智能标注方案:生成doccano要求json格式
【oracle】简单的日期时间的格式化与排序问题
K6el-100 leakage relay
基于NCF的多模块协同实例
Disk monitoring related commands
不同网段之间实现GDB远程调试功能
Design, configuration and points for attention of network arbitrary source multicast (ASM) simulation using OPNET
ssm框架的简单案例
5阶多项式轨迹
在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
Jhok-zbl1 leakage relay
Creation and use of thread pool
If you want to choose some departments to give priority to OKR, how should you choose pilot departments?
4. Object mapping Mapster
Where is NPDP product manager certification sacred?