当前位置:网站首页>Paper reading [semantic tag enlarged xlnv model for video captioning]
Paper reading [semantic tag enlarged xlnv model for video captioning]
2022-07-07 05:34:00 【hei_ hei_ hei_】
Semantic Tag Augmented XlanV Model for Video Captioning
- publish :ACMM 2021
- Code :ST-XlanV
- idea: Model generation through pre training semantic tag Reduce the difference between modes , enhance XlanV The power of the model . Use cross-modal attention Capture dynamics & Static features and vision & Interaction between semantic features . Three pre training tasks are designed for tag alignment
Detailed design
Feeling ACMM The ideas of these articles are very similar , All with the original X-Linear That one is very similar , Just extend it to multimodality .
1. Semantic Tag Augmented XlanV Model
The general framework is similar to the previous one , All right. multi-modal feature Pass respectively XLAN encoder Extract high-order features , then concate After input cross encoder Extract contains cross-modal interactions Of feature, Last input LSTM Decode and generate captions
2. Cross-modal Attention
Each feature passes its own encoder Add location information after coding , then concate Together and enter a XLAN encoder in , The output characteristic is cross-modal feature. Input after average pooling LSTM in . The specific calculation is as follows :
C ~ \widetilde C C Represents the characteristics after average pooling , E y t − 1 E_{y_{t-1}} Eyt−1 Indicates the output word of the last moment embedding
3. Pre-training Tasks
- Tag Alignment Prediction (TAP): Randomly replace the semantic tags of the current video with other tags , The probability of 50%, And predict whether the tag has been replaced
- Mask Language Modeling (MLM): And bert similar , Random mask fall 15% The words of the input sentence
- Video Captioning(VCAP):caption generation
experimental result
Ablative Studies
summary :semantic tag It's about getting up vision and language The bridge ; The pre training task is conducive to the full use of the model multi-modal interactions; Reinforcement learning strategies can improve the performance of the modelPerformance Comparison
P P P Indicates that the model uses a pre training task ; R L RL RL Indicates the use of reinforcement learning strategies
边栏推荐
- 导航栏根据路由变换颜色
- Taobao commodity details page API interface, Taobao commodity list API interface, Taobao commodity sales API interface, Taobao app details API interface, Taobao details API interface
- 1.AVL树:左右旋-bite
- Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
- How Alibaba cloud's DPCA architecture works | popular science diagram
- Two person game based on bevy game engine and FPGA
- High voltage leakage relay bld-20
- How can project managers counter attack with NPDP certificates? Look here
- Pytest testing framework -- data driven
- Jhok-zbl1 leakage relay
猜你喜欢
Mysql database learning (8) -- MySQL content supplement
在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
不同网段之间实现GDB远程调试功能
[论文阅读] A Multi-branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation
《2》 Label
Lombok插件
Torch optimizer small parsing
Cve-2021-3156 vulnerability recurrence notes
随机推荐
1.AVL树:左右旋-bite
论文阅读【Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention】
Leetcode 1189 maximum number of "balloons" [map] the leetcode road of heroding
数字化如何影响工作流程自动化
How does mapbox switch markup languages?
[Oracle] simple date and time formatting and sorting problem
Initial experience of annotation
Design, configuration and points for attention of network unicast (one server, multiple clients) simulation using OPNET
Addressable pre Download
Unity让摄像机一直跟随在玩家后上方
sql优化常用技巧及理解
Design, configuration and points for attention of network specified source multicast (SSM) simulation using OPNET
Mybaits之多表查询(联合查询、嵌套查询)
做自媒体视频剪辑,专业的人会怎么寻找背景音乐素材?
High voltage leakage relay bld-20
淘宝店铺发布API接口(新),淘宝oAuth2.0店铺商品API接口,淘宝商品发布API接口,淘宝商品上架API接口,一整套发布上架店铺接口对接分享
JD commodity details page API interface, JD commodity sales API interface, JD commodity list API interface, JD app details API interface, JD details API interface, JD SKU information interface
K6el-100 leakage relay
CVE-2021-3156 漏洞复现笔记
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】