当前位置:网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
2022-07-07 05:34:00 【hei_ hei_ hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
summary
- publish :ACMM 2021
- idea: Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN(early fuse) And fine tuning pretrained OPT(late fuse) Get first
Detailed design
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
Almost all modal features in the video are considered , Include :
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
The feeling is OPT+XLAN, Almost nothing has changed
F x F_x Fx Represents the input feature , E x E_x Ex It mainly embeds various modal features into the same semantic hidden space , E n c o d e r x Encoder_x Encoderx yes XLAN encoder
there A G G i n AGG_in AGGin and A G G c t x AGG_ctx AGGctx Indicates the aggregation method , There are several options :average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .
- OPT
Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features (inter- and intra interactions), Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .
experiment
- Ablation Studies
S P SP SP Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in - Comparison to State-of-the-art
+ R L +RL +RL Indicates that it is used in fine-tuning reinforcement learning
边栏推荐
- Under the trend of Micah, orebo and apple homekit, how does zhiting stand out?
- Leakage relay llj-100fs
- Getting started with DES encryption
- Leetcode: maximum number of "balloons"
- 利用OPNET进行网络仿真时网络层协议(以QoS为例)的使用、配置及注意点
- 痛心啊 收到教训了
- [Oracle] simple date and time formatting and sorting problem
- Wonderful express | Tencent cloud database June issue
- Tencent cloud database public cloud market ranks top 2!
- 导航栏根据路由变换颜色
猜你喜欢
Senior programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization, and recommends collecting
ThinkPHP Association preload with
论文阅读【Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention】
Two person game based on bevy game engine and FPGA
分布式事务解决方案之TCC
Cve-2021-3156 vulnerability recurrence notes
一条 update 语句的生命经历
人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
【js组件】自定义select
JHOK-ZBG2漏电继电器
随机推荐
Use Zhiyun reader to translate statistical genetics books
Leakage relay llj-100fs
Record a pressure measurement experience summary
ssm框架的简单案例
JVM (XX) -- performance monitoring and tuning (I) -- Overview
Design, configuration and points for attention of network specified source multicast (SSM) simulation using OPNET
The year of the tiger is coming. Come and make a wish. I heard that the wish will come true
Mybaits之多表查询(联合查询、嵌套查询)
Use, configuration and points for attention of network layer protocol (taking QoS as an example) when using OPNET for network simulation
淘宝店铺发布API接口(新),淘宝oAuth2.0店铺商品API接口,淘宝商品发布API接口,淘宝商品上架API接口,一整套发布上架店铺接口对接分享
阿里云的神龙架构是怎么工作的 | 科普图解
Where is NPDP product manager certification sacred?
Mysql database learning (7) -- a brief introduction to pymysql
Zero sequence aperture of leakage relay jolx-gs62 Φ one hundred
DOM node object + time node comprehensive case
NPDP产品经理认证,到底是何方神圣?
论文阅读【Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention】
Safe landing practice of software supply chain under salesforce containerized ISV scenario
Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
JVM(十九) -- 字节码与类的加载(四) -- 再谈类的加载器