当前位置:网站首页>论文阅读【Open-book Video Captioning with Retrieve-Copy-Generate Network】
论文阅读【Open-book Video Captioning with Retrieve-Copy-Generate Network】
2022-07-06 23:36:00 【hei_hei_hei_】
Open-book Video Captioning with Retrieve-Copy-Generate Network
概要
- 发表:CVPR 2021
- idea:作者认为之前的方法由于生成caption的时候缺乏一定的指导,因此生成的caption比较单调,并且由于训练数据集是固定的,所以模型训练后学到的知识是不可扩展的。作者想到通过video-to-text检索任务,从语料库中检索句子作为caption的指导。类似开卷考试(open-domain mechanism)
详细设计
1. Effective Video-to-Text Retriever
将语料库中所有sentences通过一个textual encoder映射到d维,videos通过visual encoder映射到d维,求相似度作为选择标准标准
Textual Encoder:bi-LSTM
ps: L L L表示句子长度, W s W_s Ws是可学习的embedding矩阵, η s \eta _s ηs为LSTM的参数
将长度L的sentence聚合成一个d维的vector:
v s v_s vs是聚合参数Visual Encoder:appearance features && motion features
v a , v m v_a,v_m va,vm是聚合参数video-to-text similarity:
最终得到k个检索出来指导的句子
2. Copy-mechanism Caption Generator
通过Hierarchical Caption Decoder来生成caption,只是在每一个step通过Dynamic Multi-pointers Module决定是否要copy指导的word
2.1 Hierarchical Caption Decoder
由一个attention-LSTM和一个language-LSTM组成。attention-LSTM用于注意visual features用于聚合当前的状态和视觉上下文以生成词汇库的概率分布 p v o c p_{voc} pvoc
- attention-LSTM
x = [ x m ; x a ] x = [x^m;x^a] x=[xm;xa], y t − 1 y_{t-1} yt−1表示上一step生成的单词 - language-LSTM
W b o c , b b o c W_{boc},b_{boc} Wboc,bboc都是可学习参数
2.2 Dynamic Multi-pointers Module
前提:已经得到K个候选sentences每个sentence有L个单词
对每个句子分别处理。将decoder中的hidden state h t l h^l_t htl作为Q对句子中L个单词做attention,得到L个单词的注意力概率分布
p r e t , i p_{ret,i} pret,i表示第i个句子中各个单词的注意力分布权重; c i , t r c_{i,t}^r ci,tr表示加权后的结果。决定是否copy选择的单词
得到最终所有词汇的概率分布( p r e t p_{ret} pret被扩展, p c o p y p_{copy} pcopy被广播)
3. Training
- 策略1:为了可扩展语料库,可以固定retriever,fine-tuning generator。
- 策略2:也可以二者一起训练,但是如果直接更新retriever会导致generator从一开始就训练得很差,所以对Loss中添加了限制
实验结果
- 消融实验
different K
different corpus size - Comparison Performance
结果其实一般,都没有超过20年的一些实验
边栏推荐
- 拿到PMP认证带来什么改变?
- Talk about mvcc multi version concurrency controller?
- TabLayout修改自定义的Tab标题不生效问题
- 论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
- 【js组件】date日期显示。
- Where is NPDP product manager certification sacred?
- Summary of the mean value theorem of higher numbers
- QT simple layout box model with spring
- Two person game based on bevy game engine and FPGA
- K6el-100 leakage relay
猜你喜欢
【js组件】自定义select
CentOS 7.9 installing Oracle 21C Adventures
[opencv] image morphological operation opencv marks the positions of different connected domains
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
人体传感器好不好用?怎么用?Aqara绿米、小米之间到底买哪个
Use Zhiyun reader to translate statistical genetics books
Senior programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization, and recommends collecting
一条 update 语句的生命经历
SQL injection - secondary injection and multi statement injection
《4》 Form
随机推荐
漏电继电器JELR-250FG
最长公共子序列(LCS)(动态规划,递归)
When deleting a file, the prompt "the length of the source file name is greater than the length supported by the system" cannot be deleted. Solution
pmp真的有用吗?
Unity让摄像机一直跟随在玩家后上方
Batch size setting skills
Addressable pre Download
Leetcode (46) - Full Permutation
Tencent cloud database public cloud market ranks top 2!
Two methods of thread synchronization
《2》 Label
pytest测试框架——数据驱动
2039: [Bluebridge cup 2022 preliminaries] Li Bai's enhanced version (dynamic planning)
最长不下降子序列(LIS)(动态规划)
K6el-100 leakage relay
基于 hugging face 预训练模型的实体识别智能标注方案:生成doccano要求json格式
Mysql database learning (8) -- MySQL content supplement
Creation and use of thread pool
The founder has a debt of 1billion. Let's start the class. Is it about to "end the class"?
线程池的创建与使用