当前位置:网站首页>Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]

Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]

2022-07-07 05:34:00 hei_ hei_ hei_

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Summary

  • publish :ACMM 2021
  • Code :MMAC
  • idea: This paper proposes a new video description task , Self centered visual description ( For example, first person perspective 、 Third person perspective ), It can be used for closer visual description . meanwhile , In order to alleviate motion blur caused by equipment and other reasons 、 Occlusion and so on , An auxiliary tool for visual description using sensors .
    In network design , There are mainly two modules :AMMT Modules are used to merge visual features h v h_v hv And sensor characteristics h s h_s hs Get merged features h V + S h_{V+S} hV+S, Then these three characteristics ( h v , h s , h V + S h_v, h_s, h_{V+S} hv,hs,hV+S) Input to DMA Selective attention learning in the module . Then input GRU In the middle of word Generate

Detailed design

1. feature extraction

  • Visual features h V h_V hV:Vgg16
  • Sensor characteristics h S h_S hS:LSTM( sequential )

2. Asymmetric Multi-modal Transformation(AMMT)

In essence, it is feature merging
Source :FiLM: Visual Reasoning with a General Conditioning Layer, Knowledge point reference feature-wise linear modulation
 Insert picture description here
ps: initialization W c = I , b c = 0 W_c=I, b_c=0 Wc=I,bc=0, Is initialized to concate, With the deepening of training , Learn the merging characteristics of the two

Note that the output features here are three kinds of features :
(1) Visual features h V h_V hV
(2) Sensor characteristics h S h_S hS
(3) Merged features h V + S h_{V+S} hV+S

  • Some use asymmetric explanations
    On the one hand, it can alleviate the over fitting caused by data redundancy ; On the other hand , Sensor data sometimes contains unwanted noise , Therefore, it needs to be adjusted .

3. Dynamic Modal Attention (DMA)

Dynamically select attention for three features
 Insert picture description here
 Insert picture description here
 Insert picture description here
It's used here Gumbel Softmax
ps: Reasons for using three features : Because in many cases , It is desirable to use only a single mode ( for example , Sensor data containing unwanted noise ).

原网站

版权声明
本文为[hei_ hei_ hei_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207062335134538.html