当前位置:网站首页>Paper reading [open book video captioning with retrieve copy generate network]

Paper reading [open book video captioning with retrieve copy generate network]

2022-07-07 05:34:00 hei_ hei_ hei_

Open-book Video Captioning with Retrieve-Copy-Generate Network

Summary

  • publish :CVPR 2021
  • idea: The author believes that the previous method is due to generation caption Lack of guidance when , So generated caption It's monotonous , And because the training data set is fixed , Therefore, the knowledge learned after model training is not scalable . The author thought of passing video-to-text Search task , Retrieve sentences from the corpus as caption Guidance of . Similar to open book examination (open-domain mechanism)

Detailed design

 Insert picture description here

1. Effective Video-to-Text Retriever

  • Put all in the corpus sentences Through one textual encoder Mapping to d dimension ,videos adopt visual encoder Mapping to d dimension , Find the similarity as the selection standard
     Insert picture description here

  • Textual Encoder:bi-LSTM
     Insert picture description here
    ps: L L L Indicates the length of the sentence , W s W_s Ws It's learnable embedding matrix , η s \eta _s ηs by LSTM Parameters of
     Insert picture description here
    Will the length L Of sentence Aggregate into one d Dimensional vector: Insert picture description here
    v s v_s vs Is the aggregation parameter

  • Visual Encoder:appearance features && motion features
     Insert picture description here
     Insert picture description here
    v a , v m v_a,v_m va,vm Is the aggregation parameter

  • video-to-text similarity: Insert picture description here
    The resulting k Search out the guiding sentences

2. Copy-mechanism Caption Generator

adopt Hierarchical Caption Decoder To generate caption, Just in every step adopt Dynamic Multi-pointers Module Decide whether to copy Guided word

2.1 Hierarchical Caption Decoder

By a attention-LSTM And a language-LSTM form .attention-LSTM For attention visual features The probability distribution used to aggregate the current state and visual context to generate a vocabulary p v o c p_{voc} pvoc

  • attention-LSTM
     Insert picture description here
    x = [ x m ; x a ] x = [x^m;x^a] x=[xm;xa], y t − 1 y_{t-1} yt1 Indicates the last step Generated words
  • language-LSTM
     Insert picture description here
    W b o c , b b o c W_{boc},b_{boc} Wboc,bboc Are learnable parameters
2.2 Dynamic Multi-pointers Module

Premise : Already got K Candidates sentences Insert picture description here Every sentence Yes L Word  Insert picture description here

  • Deal with each sentence separately . take decoder Medium hidden state h t l h^l_t htl As Q In the sentence L Words do attention, obtain L Attention probability distribution of words
     Insert picture description here
    p r e t , i p_{ret,i} pret,i It means the first one i The weight of attention distribution of each word in a sentence ; c i , t r c_{i,t}^r ci,tr Represents the weighted result .

  • Decide whether to copy The selected word
     Insert picture description here

  • Get the probability distribution of all the final words ( p r e t p_{ret} pret Be extended , p c o p y p_{copy} pcopy Be broadcast )
     Insert picture description here

3. Training

  • Strategy 1: In order to expand the corpus , It can be fixed retriever,fine-tuning generator.
  • Strategy 2: You can also train together , But if you update directly retriever It can lead to generator Poor training from the beginning , So for Loss Added restrictions
     Insert picture description here

experimental result

  • Ablation Experiment
    different K
     Insert picture description here
    different corpus size
     Insert picture description here
  • Comparison Performance
     Insert picture description here
    The result is actually average , No more than 20 Some experiments in
原网站

版权声明
本文为[hei_ hei_ hei_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207062335134274.html