当前位置:网站首页>Paper reading [semantic tag enlarged xlnv model for video captioning]

Paper reading [semantic tag enlarged xlnv model for video captioning]

2022-07-07 05:34:00 hei_ hei_ hei_

Semantic Tag Augmented XlanV Model for Video Captioning

  • publish :ACMM 2021
  • Code :ST-XlanV
  • idea: Model generation through pre training semantic tag Reduce the difference between modes , enhance XlanV The power of the model . Use cross-modal attention Capture dynamics & Static features and vision & Interaction between semantic features . Three pre training tasks are designed for tag alignment

Detailed design

 Insert picture description here
Feeling ACMM The ideas of these articles are very similar , All with the original X-Linear That one is very similar , Just extend it to multimodality .

1. Semantic Tag Augmented XlanV Model

The general framework is similar to the previous one , All right. multi-modal feature Pass respectively XLAN encoder Extract high-order features , then concate After input cross encoder Extract contains cross-modal interactions Of feature, Last input LSTM Decode and generate captions

2. Cross-modal Attention

Each feature passes its own encoder Add location information after coding , then concate Together and enter a XLAN encoder in , The output characteristic is cross-modal feature. Input after average pooling LSTM in . The specific calculation is as follows :
 Insert picture description here
C ~ \widetilde C C Represents the characteristics after average pooling , E y t − 1 E_{y_{t-1}} Eyt1 Indicates the output word of the last moment embedding

3. Pre-training Tasks

  • Tag Alignment Prediction (TAP): Randomly replace the semantic tags of the current video with other tags , The probability of 50%, And predict whether the tag has been replaced
     Insert picture description here
  • Mask Language Modeling (MLM): And bert similar , Random mask fall 15% The words of the input sentence
     Insert picture description here
  • Video Captioning(VCAP):caption generation
     Insert picture description here

experimental result

  • Ablative Studies
     Insert picture description here
    summary :semantic tag It's about getting up vision and language The bridge ; The pre training task is conducive to the full use of the model multi-modal interactions; Reinforcement learning strategies can improve the performance of the model

  • Performance Comparison
     Insert picture description here
    P P P Indicates that the model uses a pre training task ; R L RL RL Indicates the use of reinforcement learning strategies

原网站

版权声明
本文为[hei_ hei_ hei_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207062335134335.html