当前位置：网站首页>Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]

Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]

2022-07-07 05:34:00 【hei_ hei_ hei_】

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Summary

publish ：ACMM 2021
Code ：MMAC
idea： This paper proposes a new video description task , Self centered visual description （ For example, first person perspective 、 Third person perspective ）, It can be used for closer visual description . meanwhile , In order to alleviate motion blur caused by equipment and other reasons 、 Occlusion and so on , An auxiliary tool for visual description using sensors .
In network design , There are mainly two modules ：AMMT Modules are used to merge visual features $h_v$ And sensor characteristics $h_s$ Get merged features $h_{V+S}$ , Then these three characteristics （ $h_v, h_s, h_{V+S}$ ） Input to DMA Selective attention learning in the module . Then input GRU In the middle of word Generate

Detailed design

1. feature extraction

Visual features $h_V$ ：Vgg16
Sensor characteristics $h_S$ ：LSTM（ sequential ）

2. Asymmetric Multi-modal Transformation（AMMT）

In essence, it is feature merging
Source ：FiLM: Visual Reasoning with a General Conditioning Layer, Knowledge point reference feature-wise linear modulation
Insert picture description here
ps： initialization $W_c=I, b_c=0$ , Is initialized to concate, With the deepening of training , Learn the merging characteristics of the two

Note that the output features here are three kinds of features ：
（1） Visual features $h_V$
（2） Sensor characteristics $h_S$
（3） Merged features $h_{V+S}$

Some use asymmetric explanations
On the one hand, it can alleviate the over fitting caused by data redundancy ; On the other hand , Sensor data sometimes contains unwanted noise , Therefore, it needs to be adjusted .

3. Dynamic Modal Attention (DMA)

Dynamically select attention for three features
Insert picture description here

It's used here Gumbel Softmax
ps： Reasons for using three features ： Because in many cases , It is desirable to use only a single mode （ for example , Sensor data containing unwanted noise ）.