当前位置:网站首页>Paper reading [semantic tag enlarged xlnv model for video captioning]
Paper reading [semantic tag enlarged xlnv model for video captioning]
2022-07-07 05:34:00 【hei_ hei_ hei_】
Semantic Tag Augmented XlanV Model for Video Captioning
- publish :ACMM 2021
- Code :ST-XlanV
- idea: Model generation through pre training semantic tag Reduce the difference between modes , enhance XlanV The power of the model . Use cross-modal attention Capture dynamics & Static features and vision & Interaction between semantic features . Three pre training tasks are designed for tag alignment
Detailed design
Feeling ACMM The ideas of these articles are very similar , All with the original X-Linear That one is very similar , Just extend it to multimodality .
1. Semantic Tag Augmented XlanV Model
The general framework is similar to the previous one , All right. multi-modal feature Pass respectively XLAN encoder Extract high-order features , then concate After input cross encoder Extract contains cross-modal interactions Of feature, Last input LSTM Decode and generate captions
2. Cross-modal Attention
Each feature passes its own encoder Add location information after coding , then concate Together and enter a XLAN encoder in , The output characteristic is cross-modal feature. Input after average pooling LSTM in . The specific calculation is as follows :
C ~ \widetilde C C Represents the characteristics after average pooling , E y t − 1 E_{y_{t-1}} Eyt−1 Indicates the output word of the last moment embedding
3. Pre-training Tasks
- Tag Alignment Prediction (TAP): Randomly replace the semantic tags of the current video with other tags , The probability of 50%, And predict whether the tag has been replaced
- Mask Language Modeling (MLM): And bert similar , Random mask fall 15% The words of the input sentence
- Video Captioning(VCAP):caption generation
experimental result
Ablative Studies
summary :semantic tag It's about getting up vision and language The bridge ; The pre training task is conducive to the full use of the model multi-modal interactions; Reinforcement learning strategies can improve the performance of the modelPerformance Comparison
P P P Indicates that the model uses a pre training task ; R L RL RL Indicates the use of reinforcement learning strategies
边栏推荐
- JHOK-ZBL1漏电继电器
- How does redis implement multiple zones?
- [optimal web page width and its implementation] [recommended collection "
- Leetcode (417) -- Pacific Atlantic current problem
- JSP setting header information export to excel
- DOM node object + time node comprehensive case
- 痛心啊 收到教训了
- Photo selector collectionview
- JHOK-ZBG2漏电继电器
- 人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
猜你喜欢
English语法_名词 - 所有格
Senior programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization, and recommends collecting
AOSP ~binder communication principle (I) - Overview
JVM(二十) -- 性能监控与调优(一) -- 概述
[论文阅读] A Multi-branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation
Leetcode: maximum number of "balloons"
高级程序员必知必会,一文详解MySQL主从同步原理,推荐收藏
高压漏电继电器BLD-20
Use Zhiyun reader to translate statistical genetics books
Design, configuration and points for attention of network specified source multicast (SSM) simulation using OPNET
随机推荐
什么是依赖注入(DI)
漏电继电器JOLX-GS62零序孔径Φ100
As we media, what websites are there to download video clips for free?
说一说MVCC多版本并发控制器?
Cve-2021-3156 vulnerability recurrence notes
Digital innovation driven guide
Sorry, I've learned a lesson
Tencent cloud database public cloud market ranks top 2!
淘宝店铺发布API接口(新),淘宝oAuth2.0店铺商品API接口,淘宝商品发布API接口,淘宝商品上架API接口,一整套发布上架店铺接口对接分享
Tablayout modification of customized tab title does not take effect
Under the trend of Micah, orebo and apple homekit, how does zhiting stand out?
Leetcode 1189 maximum number of "balloons" [map] the leetcode road of heroding
实现网页内容可编辑
Mybaits之多表查询(联合查询、嵌套查询)
Unity让摄像机一直跟随在玩家后上方
4. 对象映射 - Mapping.Mapster
If you want to choose some departments to give priority to OKR, how should you choose pilot departments?
[论文阅读] A Multi-branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation
Mysql database learning (8) -- MySQL content supplement
Safe landing practice of software supply chain under salesforce containerized ISV scenario