当前位置:网站首页>论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
2022-07-06 23:35:00 【hei_hei_hei_】
Semantic Tag Augmented XlanV Model for Video Captioning
- 发表:ACMM 2021
- 代码:ST-XlanV
- idea:通过预训练的模型生成semantic tag减小模态之间的差异,增强XlanV模型的能力。使用cross-modal attention捕捉动态&静态特征以及视觉&语义特征之间的交互。设计了三个预训练任务用于tag alignment
详细设计

感觉ACMM这几篇的思路都很类似,都和原始的X-Linear那篇很像,只是将其扩展到多模态。
1. Semantic Tag Augmented XlanV Model
大致框架与上一篇类似,都是对multi-modal feature分别通过XLAN encoder提取高阶特征,然后concate之后输入cross encoder中提取包含cross-modal interactions的feature,最后输入LSTM中解码并生成captions
2. Cross-modal Attention
各个特征通过各自的encoder编码后加上位置信息,然后concate在一起并输入一个XLAN encoder中,输出特征则为cross-modal feature。平均池化后输入LSTM中。具体计算如下:
C ~ \widetilde C C表示平均池化后的特征, E y t − 1 E_{y_{t-1}} Eyt−1表示上一时刻输出词的embedding
3. Pre-training Tasks
- Tag Alignment Prediction (TAP):用其他标记随机替换当前视频的语义标记,概率为50%,并预测标记是否已被替换

- Mask Language Modeling (MLM):与bert类似,随机mask掉15%的输入的句子的词

- Video Captioning(VCAP):caption generation

实验结果
Ablative Studies

总结:semantic tag是架起vision和language的桥梁;预训练任务有利于模型能够充分利用multi-modal interactions;强化学习策略能够改善模型的表现Performance Comparison

P P P表示模型使用了预训练任务; R L RL RL表示使用了强化学习策略
边栏推荐
- 1.AVL树:左右旋-bite
- 张平安:加快云上数字创新,共建产业智慧生态
- As we media, what websites are there to download video clips for free?
- App clear data source code tracking
- The sooner you understand the four rules of life, the more blessed you will be
- 一条 update 语句的生命经历
- 2039: [蓝桥杯2022初赛] 李白打酒加强版 (动态规划)
- Linkedblockingqueue source code analysis - initialization
- Use Zhiyun reader to translate statistical genetics books
- If you want to choose some departments to give priority to OKR, how should you choose pilot departments?
猜你喜欢

Autowired注解用于List时的现象解析

JHOK-ZBG2漏电继电器

Mysql database learning (8) -- MySQL content supplement

Harmonyos fourth training

Two person game based on bevy game engine and FPGA

Senior programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization, and recommends collecting

Record a pressure measurement experience summary

阿里云的神龙架构是怎么工作的 | 科普图解

Initial experience of annotation

Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
随机推荐
pytest测试框架——数据驱动
JHOK-ZBL1漏电继电器
DFS,BFS以及图的遍历搜索
Most commonly used high number formula
Make web content editable
pmp真的有用吗?
When deleting a file, the prompt "the length of the source file name is greater than the length supported by the system" cannot be deleted. Solution
在米家、欧瑞博、苹果HomeKit趋势下,智汀如何从中脱颖而出?
【oracle】简单的日期时间的格式化与排序问题
【js组件】自定义select
Under the trend of Micah, orebo and apple homekit, how does zhiting stand out?
[JS component] date display.
What changes will PMP certification bring?
Knapsack problem (01 knapsack, complete knapsack, dynamic programming)
[question] Compilation Principle
利用OPNET进行网络指定源组播(SSM)仿真的设计、配置及注意点
Array initialization of local variables
Life experience of an update statement
JVM(二十) -- 性能监控与调优(一) -- 概述
SQL injection - secondary injection and multi statement injection