当前位置：网站首页>Paper reading [semantic tag enlarged xlnv model for video captioning]

Paper reading [semantic tag enlarged xlnv model for video captioning]

2022-07-07 05:34:00 【hei_ hei_ hei_】

Semantic Tag Augmented XlanV Model for Video Captioning

publish ：ACMM 2021
Code ：ST-XlanV
idea： Model generation through pre training semantic tag Reduce the difference between modes , enhance XlanV The power of the model . Use cross-modal attention Capture dynamics & Static features and vision & Interaction between semantic features . Three pre training tasks are designed for tag alignment

Detailed design

Insert picture description here
Feeling ACMM The ideas of these articles are very similar , All with the original X-Linear That one is very similar , Just extend it to multimodality .

1. Semantic Tag Augmented XlanV Model

The general framework is similar to the previous one , All right. multi-modal feature Pass respectively XLAN encoder Extract high-order features , then concate After input cross encoder Extract contains cross-modal interactions Of feature, Last input LSTM Decode and generate captions

2. Cross-modal Attention

Each feature passes its own encoder Add location information after coding , then concate Together and enter a XLAN encoder in , The output characteristic is cross-modal feature. Input after average pooling LSTM in . The specific calculation is as follows ：
Insert picture description here
$\widetilde C$ Represents the characteristics after average pooling , $E_{y_{t-1}}$ Indicates the output word of the last moment embedding

3. Pre-training Tasks

Tag Alignment Prediction (TAP)： Randomly replace the semantic tags of the current video with other tags , The probability of 50%, And predict whether the tag has been replaced
Mask Language Modeling (MLM)： And bert similar , Random mask fall 15% The words of the input sentence
Video Captioning(VCAP)：caption generation

experimental result

Ablative Studies

summary ：semantic tag It's about getting up vision and language The bridge ; The pre training task is conducive to the full use of the model multi-modal interactions; Reinforcement learning strategies can improve the performance of the model
Performance Comparison

$P$ Indicates that the model uses a pre training task ; $R L$ Indicates the use of reinforcement learning strategies