当前位置:网站首页>Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
2022-07-07 05:34:00 【hei_ hei_ hei_】
MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Technique
summary
- publish :ACMM 2021
- idea: Use X-Linear Attention, reference XLAN Right Multi-modality Feature To merge , Propose a multi-path XLAN The model can fuse multiple single-mode features , Get a better fused feature . In addition, in the video understanding pre training model competition, through data expansion technology and integration multi-path XLAN(early fuse) And fine tuning pretrained OPT(late fuse) Get first
Detailed design
1. Single-Modality Pretrained Feature Fusion
Multi-Modality Feature Extraction
Almost all modal features in the video are considered , Include :
(1)appearance feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):FixResNeXt-101 network pretrained on the ImageNet-1k dataset
(2)motion feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):irCSN-152 network pretrained on the Kinetics-400 dataset
(3)region feature( 50 f r a m e s ∗ 2048 d i m s 50 frames * 2048 dims 50frames∗2048dims):vinvl model pretrained on Visual Genome dataset
(4)audio feature( 30 f r a m e s ∗ 2048 d i m s 30 frames * 2048 dims 30frames∗2048dims):CNN14 network pretrained on the AudioSet datasetMulti-Modality Feature Fusion
The feeling is OPT+XLAN, Almost nothing has changed
F x F_x Fx Represents the input feature , E x E_x Ex It mainly embeds various modal features into the same semantic hidden space , E n c o d e r x Encoder_x Encoderx yes XLAN encoder
there A G G i n AGG_in AGGin and A G G c t x AGG_ctx AGGctx Indicates the aggregation method , There are several options :average pooling、concatenation、additional attention
2. Multi-Modality Pretrained Model Finetuning
Yes pretrained Omni-Perception Pre-Trainer model (OPT) Fine tuning .
- OPT
Use three... Respectively encoder To text 、 picture 、 The sound is encoded and the features are converted to the same latent space; And then use transformer Fuse the three features (inter- and intra interactions), Then access text decoder and visual decoder Generate text and pictures respectively . At the same time token-level、modality-level and sample-level To make the model have the ability of cross modal understanding and generation . The author uses MSR-VTT Fine tune the dataset .
experiment
- Ablation Studies
S P SP SP Direct transfer multi-modality features concate Then proceed reduce dimension To 1024 Then input encoder-decoder Of XLAN/Transformer modal in - Comparison to State-of-the-art
+ R L +RL +RL Indicates that it is used in fine-tuning reinforcement learning
边栏推荐
- Talk about mvcc multi version concurrency controller?
- Summary of the mean value theorem of higher numbers
- 淘宝店铺发布API接口(新),淘宝oAuth2.0店铺商品API接口,淘宝商品发布API接口,淘宝商品上架API接口,一整套发布上架店铺接口对接分享
- Phenomenon analysis when Autowired annotation is used for list
- LabVIEW is opening a new reference, indicating that the memory is full
- 漏电继电器JELR-250FG
- 5. Data access - entityframework integration
- Getting started with DES encryption
- 做自媒体,有哪些免费下载视频剪辑素材的网站?
- How does redis implement multiple zones?
猜你喜欢
利用OPNET进行网络指定源组播(SSM)仿真的设计、配置及注意点
消息队列:消息积压如何处理?
Phenomenon analysis when Autowired annotation is used for list
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
JVM (XX) -- performance monitoring and tuning (I) -- Overview
EGR-20USCM接地故障继电器
Leakage relay jelr-250fg
《4》 Form
人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
Pytest testing framework -- data driven
随机推荐
A cool "ghost" console tool
Preliminary practice of niuke.com (9)
JVM (19) -- bytecode and class loading (4) -- talk about class loader again
English语法_名词 - 所有格
Mapbox Chinese map address
纪念下,我从CSDN搬家到博客园啦!
Lombok插件
MySQL数据库学习(7) -- pymysql简单介绍
Design, configuration and points for attention of network arbitrary source multicast (ASM) simulation using OPNET
JVM(十九) -- 字节码与类的加载(四) -- 再谈类的加载器
Dbsync adds support for mongodb and ES
淘宝商品详情页API接口、淘宝商品列表API接口,淘宝商品销量API接口,淘宝APP详情API接口,淘宝详情API接口
Timer create timer
pytest测试框架——数据驱动
Leetcode (417) -- Pacific Atlantic current problem
5. 数据访问 - EntityFramework集成
DJ-ZBS2漏电继电器
Leetcode: maximum number of "balloons"
How can professional people find background music materials when doing we media video clips?
[Oracle] simple date and time formatting and sorting problem