当前位置：网站首页>Paper reading [open book video captioning with retrieve copy generate network]

Paper reading [open book video captioning with retrieve copy generate network]

2022-07-07 05:34:00 【hei_ hei_ hei_】

Open-book Video Captioning with Retrieve-Copy-Generate Network

Summary

publish ：CVPR 2021
idea： The author believes that the previous method is due to generation caption Lack of guidance when , So generated caption It's monotonous , And because the training data set is fixed , Therefore, the knowledge learned after model training is not scalable . The author thought of passing video-to-text Search task , Retrieve sentences from the corpus as caption Guidance of . Similar to open book examination （open-domain mechanism）

Detailed design

Insert picture description here

1. Effective Video-to-Text Retriever

Put all in the corpus sentences Through one textual encoder Mapping to d dimension ,videos adopt visual encoder Mapping to d dimension , Find the similarity as the selection standard
Textual Encoder：bi-LSTM

ps： $L$ Indicates the length of the sentence , $W_s$ It's learnable embedding matrix , $\eta _s$ by LSTM Parameters of

Will the length L Of sentence Aggregate into one d Dimensional vector：
$v_s$ Is the aggregation parameter
Visual Encoder：appearance features && motion features

$v_a,v_m$ Is the aggregation parameter
video-to-text similarity：
The resulting k Search out the guiding sentences

2. Copy-mechanism Caption Generator

adopt Hierarchical Caption Decoder To generate caption, Just in every step adopt Dynamic Multi-pointers Module Decide whether to copy Guided word

2.1 Hierarchical Caption Decoder

By a attention-LSTM And a language-LSTM form .attention-LSTM For attention visual features The probability distribution used to aggregate the current state and visual context to generate a vocabulary $p_{voc}$

attention-LSTM

$x = [x^m;x^a]$ , $y_{t-1}$ Indicates the last step Generated words
language-LSTM

$W_{boc},b_{boc}$ Are learnable parameters

2.2 Dynamic Multi-pointers Module

Premise ： Already got K Candidates sentences Insert picture description here Every sentence Yes L Word

Deal with each sentence separately . take decoder Medium hidden state $h^l_t$ As Q In the sentence L Words do attention, obtain L Attention probability distribution of words

$p_{ret,i}$ It means the first one i The weight of attention distribution of each word in a sentence ; $c_{i,t}^r$ Represents the weighted result .
Decide whether to copy The selected word
Get the probability distribution of all the final words （ $p_{ret}$ Be extended , $p_{copy}$ Be broadcast ）

3. Training

Strategy 1： In order to expand the corpus , It can be fixed retriever,fine-tuning generator.
Strategy 2： You can also train together , But if you update directly retriever It can lead to generator Poor training from the beginning , So for Loss Added restrictions