当前位置:网站首页>Paper reading [semantic tag enlarged xlnv model for video captioning]
Paper reading [semantic tag enlarged xlnv model for video captioning]
2022-07-07 05:34:00 【hei_ hei_ hei_】
Semantic Tag Augmented XlanV Model for Video Captioning
- publish :ACMM 2021
- Code :ST-XlanV
- idea: Model generation through pre training semantic tag Reduce the difference between modes , enhance XlanV The power of the model . Use cross-modal attention Capture dynamics & Static features and vision & Interaction between semantic features . Three pre training tasks are designed for tag alignment
Detailed design
Feeling ACMM The ideas of these articles are very similar , All with the original X-Linear That one is very similar , Just extend it to multimodality .
1. Semantic Tag Augmented XlanV Model
The general framework is similar to the previous one , All right. multi-modal feature Pass respectively XLAN encoder Extract high-order features , then concate After input cross encoder Extract contains cross-modal interactions Of feature, Last input LSTM Decode and generate captions
2. Cross-modal Attention
Each feature passes its own encoder Add location information after coding , then concate Together and enter a XLAN encoder in , The output characteristic is cross-modal feature. Input after average pooling LSTM in . The specific calculation is as follows :
C ~ \widetilde C C Represents the characteristics after average pooling , E y t − 1 E_{y_{t-1}} Eyt−1 Indicates the output word of the last moment embedding
3. Pre-training Tasks
- Tag Alignment Prediction (TAP): Randomly replace the semantic tags of the current video with other tags , The probability of 50%, And predict whether the tag has been replaced
- Mask Language Modeling (MLM): And bert similar , Random mask fall 15% The words of the input sentence
- Video Captioning(VCAP):caption generation
experimental result
Ablative Studies
summary :semantic tag It's about getting up vision and language The bridge ; The pre training task is conducive to the full use of the model multi-modal interactions; Reinforcement learning strategies can improve the performance of the modelPerformance Comparison
P P P Indicates that the model uses a pre training task ; R L RL RL Indicates the use of reinforcement learning strategies
边栏推荐
- Codeforces Round #416 (Div. 2) D. Vladik and Favorite Game
- Tencent cloud database public cloud market ranks top 2!
- 痛心啊 收到教训了
- As we media, what websites are there to download video clips for free?
- Leakage relay llj-100fs
- Mybaits之多表查询(联合查询、嵌套查询)
- Annotation初体验
- DOM-节点对象+时间节点 综合案例
- 设f(x)=∑x^n/n^2,证明f(x)+f(1-x)+lnxln(1-x)=∑1/n^2
- 《4》 Form
猜你喜欢
Digital innovation driven guide
论文阅读【Open-book Video Captioning with Retrieve-Copy-Generate Network】
Record a pressure measurement experience summary
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
JVM (19) -- bytecode and class loading (4) -- talk about class loader again
Initial experience of annotation
[论文阅读] Semi-supervised Left Atrium Segmentation with Mutual Consistency Training
5. Data access - entityframework integration
Design, configuration and points for attention of network unicast (one server, multiple clients) simulation using OPNET
1.AVL树:左右旋-bite
随机推荐
[JS component] date display.
Jhok-zbg2 leakage relay
Torch optimizer small parsing
Design, configuration and points for attention of network specified source multicast (SSM) simulation using OPNET
Zhang Ping'an: accelerate cloud digital innovation and jointly build an industrial smart ecosystem
[Oracle] simple date and time formatting and sorting problem
[optimal web page width and its implementation] [recommended collection "
Writing process of the first paper
Pinduoduo product details interface, pinduoduo product basic information, pinduoduo product attribute interface
Two methods of thread synchronization
JVM(十九) -- 字节码与类的加载(四) -- 再谈类的加载器
Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
Mysql database learning (8) -- MySQL content supplement
《4》 Form
一条 update 语句的生命经历
数字化创新驱动指南
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
张平安:加快云上数字创新,共建产业智慧生态
【js组件】date日期显示。
JVM (XX) -- performance monitoring and tuning (I) -- Overview