当前位置：网站首页>Reading the paper [learning to discretely compose reasoning module networks for video captioning]

Reading the paper [learning to discretely compose reasoning module networks for video captioning]

2022-07-01 19:24:00 【hei_ hei_ hei_】

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

1. Summary

publish ：IJCAI 2020
Code ：https://github.com/tgc1997/RMN
idea： The author believes that the generation of video description is step-by-step Of . For the generation of a sentence , First, we need to locate and describe the subject subject, Then reasoning action , Then locate and describe the object object. And such a process , The author believes that it requires complex spatiotemporal reasoning . For reasoning modules , The author designs three modules locate,relate,func, Respectively used to locate the target （2D）, Reasoning relationship （3D） And the generation of some conjunctions （ Such as a、the、and）; For the selection module , The author designed Module Selector It is used to select one of the above modules when generating the next word .

2. Detailed design

2.1 Encoder

feature extraction ： Separate use 2D-CNN, 3D-CNN, R-CNN Extracted video appearance feature $V_a$ , motion feature $V_m$ ,object feature $V_o$ . Notice the $V_o$ With location information （ It's reflected in the code ）
Feature handling ： about $V_a$ and $V_m$ , The author has used Bi-LATMs Processing to incorporate temporal information into features .
Guidance for the entire network $h_t^{en}$ ：LSTM The hidden layer output of . Input is global visual information $\bar v$ , Last one step Of the last word generated embedding And hidden layer state

2.2 Reasoning Modules

All reasoning modules are based on the following attention Calculation （Neural machine translation by jointly learning to align and translate.ICLR 2015）
Insert picture description here
Defined in this way attention You can follow the specified latitude , In order to better model the direction of space and time , The author defines... In time latitude and space latitude respectively attention： $AoS(\cdot)$ and $AoT(\cdot)$

Locate Module
Mainly for the generation of object words, Such as “man”、“basketball” etc. . Modules need to pay attention in time and space region Information , Therefore, the author will first $V_o$ Send in $AoS(\cdot)$ , And then with $V_a$ Send together $AoT(\cdot)$

there $\bigoplus$ Express concate operation
Relate Module
Mainly to generate verbs , for example “shoting”、“riding” etc. . In the picture shown below , To generate verbs “shoting”, Models need to be aware of different scenarios object Change of state , So in Relate Module In any pair of spaces attention After processing the $V_o$ Paired , Then the execution time attention
Func Module
It is mainly to generate some conjunctions to make the whole sentence coherent , Such as “of”,“and” etc. . There is no need for visual information , Only language information is needed , So right. decoder LSTM The history of cell states perform AoT

It can be found that these three modules are closely around the first mentioned in this section attention The operation is in progress , take $h_t^{en}$ As attention Of Q.

Module Selector

In the generation module , every last step Generated word It can only be one of the above three modules , Therefore, we need to design a selection module to choose . The specific implementation is to score each module , Then choose the highest score . The scoring function is designed as follows ：
Insert picture description here
But because of max The function is non differentiable , So the author uses an approximate method to one-hot vector $z_t$ Convert to continuous values $\tilde {z_t}$

The final result of visual reasoning is ：
there $\bigotimes$ Express inner product

Decoder

Used one LSTM decode , Input is the result of visual information $v_t$ ,encoder The hidden layer of
Insert picture description here
Then the visual information 、 Hidden layer information follows MLP Output the probability distribution of the corresponding dictionary to get the generated word

Training

Caption Loss：cross-entropy loss
Used to measure the accuracy of generated sentences

$T$ Indicates the length of the sentence
POS Loss：KLD loss
Used to measure the accuracy of the selection module , Specifically, put the sentence POS Convert to one-hot code , And then use KLD(Kullback-Leibler Divergence) loss To measure the similarity of two distributions . The actual implementation in code is also used cross-entropy loss
The final loss