当前位置:网站首页>Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]
Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]
2022-07-07 05:34:00 【hei_ hei_ hei_】
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
Summary
- publish :ACMM 2021
- Code :MMAC
- idea: This paper proposes a new video description task , Self centered visual description ( For example, first person perspective 、 Third person perspective ), It can be used for closer visual description . meanwhile , In order to alleviate motion blur caused by equipment and other reasons 、 Occlusion and so on , An auxiliary tool for visual description using sensors .
In network design , There are mainly two modules :AMMT Modules are used to merge visual features h v h_v hv And sensor characteristics h s h_s hs Get merged features h V + S h_{V+S} hV+S, Then these three characteristics ( h v , h s , h V + S h_v, h_s, h_{V+S} hv,hs,hV+S) Input to DMA Selective attention learning in the module . Then input GRU In the middle of word Generate
Detailed design
1. feature extraction
- Visual features h V h_V hV:Vgg16
- Sensor characteristics h S h_S hS:LSTM( sequential )
2. Asymmetric Multi-modal Transformation(AMMT)
In essence, it is feature merging
Source :FiLM: Visual Reasoning with a General Conditioning Layer, Knowledge point reference feature-wise linear modulation
ps: initialization W c = I , b c = 0 W_c=I, b_c=0 Wc=I,bc=0, Is initialized to concate, With the deepening of training , Learn the merging characteristics of the two
Note that the output features here are three kinds of features :
(1) Visual features h V h_V hV
(2) Sensor characteristics h S h_S hS
(3) Merged features h V + S h_{V+S} hV+S
- Some use asymmetric explanations
On the one hand, it can alleviate the over fitting caused by data redundancy ; On the other hand , Sensor data sometimes contains unwanted noise , Therefore, it needs to be adjusted .
3. Dynamic Modal Attention (DMA)
Dynamically select attention for three features
It's used here Gumbel Softmax
ps: Reasons for using three features : Because in many cases , It is desirable to use only a single mode ( for example , Sensor data containing unwanted noise ).
边栏推荐
- [paper reading] semi supervised left atrium segmentation with mutual consistency training
- 消息队列:重复消息如何处理?
- Jhok-zbg2 leakage relay
- 一条 update 语句的生命经历
- Zhang Ping'an: accelerate cloud digital innovation and jointly build an industrial smart ecosystem
- Under the trend of Micah, orebo and apple homekit, how does zhiting stand out?
- Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
- batch size设置技巧
- Mapbox Chinese map address
- [PHP SPL notes]
猜你喜欢
JSP setting header information export to excel
JD commodity details page API interface, JD commodity sales API interface, JD commodity list API interface, JD app details API interface, JD details API interface, JD SKU information interface
漏电继电器LLJ-100FS
The navigation bar changes colors according to the route
论文阅读【MM21 Pre-training for Video Understanding Challenge:Video Captioning with Pretraining Techniqu】
How digitalization affects workflow automation
How Alibaba cloud's DPCA architecture works | popular science diagram
漏电继电器JD1-100
[binary tree] binary tree path finding
5. 数据访问 - EntityFramework集成
随机推荐
数字化如何影响工作流程自动化
淘寶商品詳情頁API接口、淘寶商品列錶API接口,淘寶商品銷量API接口,淘寶APP詳情API接口,淘寶詳情API接口
张平安:加快云上数字创新,共建产业智慧生态
Simulate thread communication
[binary tree] binary tree path finding
说一说MVCC多版本并发控制器?
漏电继电器JELR-250FG
导航栏根据路由变换颜色
项目经理如何凭借NPDP证书逆袭?看这里
论文阅读【Semantic Tag Augmented XlanV Model for Video Captioning】
Preliminary practice of niuke.com (9)
实现网页内容可编辑
Leetcode (46) - Full Permutation
漏电继电器JD1-100
Safe landing practice of software supply chain under salesforce containerized ISV scenario
Make web content editable
【oracle】简单的日期时间的格式化与排序问题
Sorry, I've learned a lesson
How Alibaba cloud's DPCA architecture works | popular science diagram
《4》 Form