当前位置:网站首页>Reading the paper [learning to discretely compose reasoning module networks for video captioning]
Reading the paper [learning to discretely compose reasoning module networks for video captioning]
2022-07-01 19:24:00 【hei_ hei_ hei_】
Learning to Discretely Compose Reasoning Module Networks for Video Captioning
1. Summary
- publish :IJCAI 2020
- Code :https://github.com/tgc1997/RMN
- idea: The author believes that the generation of video description is step-by-step Of . For the generation of a sentence , First, we need to locate and describe the subject subject, Then reasoning action , Then locate and describe the object object. And such a process , The author believes that it requires complex spatiotemporal reasoning . For reasoning modules , The author designs three modules locate,relate,func, Respectively used to locate the target (2D), Reasoning relationship (3D) And the generation of some conjunctions ( Such as a、the、and); For the selection module , The author designed Module Selector It is used to select one of the above modules when generating the next word .

2. Detailed design
2.1 Encoder
- feature extraction : Separate use 2D-CNN, 3D-CNN, R-CNN Extracted video appearance feature V a V_a Va, motion feature V m V_m Vm,object feature V o V_o Vo. Notice the V o V_o Vo With location information ( It's reflected in the code )
- Feature handling : about V a V_a Va and V m V_m Vm, The author has used Bi-LATMs Processing to incorporate temporal information into features .
- Guidance for the entire network h t e n h_t^{en} hten:LSTM The hidden layer output of . Input is global visual information v ˉ \bar v vˉ, Last one step Of the last word generated embedding And hidden layer state

2.2 Reasoning Modules
All reasoning modules are based on the following attention Calculation (Neural machine translation by jointly learning to align and translate.ICLR 2015)
Defined in this way attention You can follow the specified latitude , In order to better model the direction of space and time , The author defines... In time latitude and space latitude respectively attention: A o S ( ⋅ ) AoS(\cdot) AoS(⋅) and A o T ( ⋅ ) AoT(\cdot) AoT(⋅)
- Locate Module
Mainly for the generation of object words, Such as “man”、“basketball” etc. . Modules need to pay attention in time and space region Information , Therefore, the author will first V o V_o Vo Send in A o S ( ⋅ ) AoS(\cdot) AoS(⋅), And then with V a V_a Va Send together A o T ( ⋅ ) AoT(\cdot) AoT(⋅)
there ⨁ \bigoplus ⨁ Express concate operation - Relate Module
Mainly to generate verbs , for example “shoting”、“riding” etc. . In the picture shown below , To generate verbs “shoting”, Models need to be aware of different scenarios object Change of state , So in Relate Module In any pair of spaces attention After processing the V o V_o Vo Paired , Then the execution time attention
- Func Module
It is mainly to generate some conjunctions to make the whole sentence coherent , Such as “of”,“and” etc. . There is no need for visual information , Only language information is needed , So right. decoder LSTM The history of cell states perform AoT
It can be found that these three modules are closely around the first mentioned in this section attention The operation is in progress , take h t e n h_t^{en} hten As attention Of Q.
Module Selector
In the generation module , every last step Generated word It can only be one of the above three modules , Therefore, we need to design a selection module to choose . The specific implementation is to score each module , Then choose the highest score . The scoring function is designed as follows :
But because of max The function is non differentiable , So the author uses an approximate method to one-hot vector z t z_t zt Convert to continuous values z t ~ \tilde {z_t} zt~
The final result of visual reasoning is :
there ⨂ \bigotimes ⨂ Express inner product
Decoder
Used one LSTM decode , Input is the result of visual information v t v_t vt,encoder The hidden layer of 
Then the visual information 、 Hidden layer information follows MLP Output the probability distribution of the corresponding dictionary to get the generated word
Training
- Caption Loss:cross-entropy loss
Used to measure the accuracy of generated sentences
T T T Indicates the length of the sentence - POS Loss:KLD loss
Used to measure the accuracy of the selection module , Specifically, put the sentence POS Convert to one-hot code , And then use KLD(Kullback-Leibler Divergence) loss To measure the similarity of two distributions . The actual implementation in code is also used cross-entropy loss
- The final loss

边栏推荐
- 赋能「新型中国企业」,SAP Process Automation 落地中国
- 组队学习! 14天鸿蒙设备开发“学练考”实战营限时免费加入!
- Three ways for redis to realize current limiting
- [quick application] there are many words in the text component. How to solve the problem that the div style next to it will be stretched
- 案例分享:QinQ基本组网配置
- 论文阅读【Learning to Discretely Compose Reasoning Module Networks for Video Captioning】
- Lake Shore - crx-em-hf low temperature probe station
- Getting started with kubernetes command (namespaces, pods)
- Today, with the popularity of micro services, how does service mesh exist?
- Redis 实现限流的三种方式
猜你喜欢

【Go ~ 0到1 】 第五天 7月1 类型别名,自定义类型,接口,包与初始化函数

DTD建模

Is PMP cancelled??

Lumiprobe 亚磷酰胺丨六甘醇亚磷酰胺说明书

Games202 operation 0 - environment building process & solving problems encountered

宝,运维100+服务器很头疼怎么办?用行云管家!

Helium transmission line of lake shore cryostat

Once the SQL is optimized, the database query speed is increased by 60 times

智慧防疫系统为建筑工地复工复产提供安全保障

Intensive cultivation of channels for joint development Fuxin and Weishi Jiajie held a new product training conference
随机推荐
【Go ~ 0到1 】 第四天 6月30 defer,结构体,方法
Contos 7 搭建sftp之创建用户、用户组以及删除用户
6月刊 | AntDB数据库参与编写《数据库发展研究报告》 亮相信创产业榜单
English语法_形容词/副词3级 -注意事项
Chinese and English instructions human soluble advanced glycation end products receptor (sRAGE) ELISA Kit
Viewing the whole ecology of Tiktok from a macro perspective
Openai video pre training (VPT): action learning based on watching unmarked online videos
DTD建模
Altair HyperWorks 2022 software installation package and installation tutorial
transform + asm资料
Summary of cases of players' disconnection and reconnection in Huawei online battle service
XML语法、约束
智慧防疫系统为建筑工地复工复产提供安全保障
Gameframework eating guide
【6.24-7.1】写作社区精彩技术博文回顾
【pytorch记录】模型的分布式训练DataParallel、DistributedDataParallel
赋能「新型中国企业」,SAP Process Automation 落地中国
华为游戏初始化init失败,返回错误码907135000
记一次 .NET 差旅管理后台 CPU 爆高分析
Go语言高级