当前位置:网站首页>Reading the paper [learning to discretely compose reasoning module networks for video captioning]
Reading the paper [learning to discretely compose reasoning module networks for video captioning]
2022-07-01 19:24:00 【hei_ hei_ hei_】
Learning to Discretely Compose Reasoning Module Networks for Video Captioning
1. Summary
- publish :IJCAI 2020
- Code :https://github.com/tgc1997/RMN
- idea: The author believes that the generation of video description is step-by-step Of . For the generation of a sentence , First, we need to locate and describe the subject subject, Then reasoning action , Then locate and describe the object object. And such a process , The author believes that it requires complex spatiotemporal reasoning . For reasoning modules , The author designs three modules locate,relate,func, Respectively used to locate the target (2D), Reasoning relationship (3D) And the generation of some conjunctions ( Such as a、the、and); For the selection module , The author designed Module Selector It is used to select one of the above modules when generating the next word .
2. Detailed design
2.1 Encoder
- feature extraction : Separate use 2D-CNN, 3D-CNN, R-CNN Extracted video appearance feature V a V_a Va, motion feature V m V_m Vm,object feature V o V_o Vo. Notice the V o V_o Vo With location information ( It's reflected in the code )
- Feature handling : about V a V_a Va and V m V_m Vm, The author has used Bi-LATMs Processing to incorporate temporal information into features .
- Guidance for the entire network h t e n h_t^{en} hten:LSTM The hidden layer output of . Input is global visual information v ˉ \bar v vˉ, Last one step Of the last word generated embedding And hidden layer state
2.2 Reasoning Modules
All reasoning modules are based on the following attention Calculation (Neural machine translation by jointly learning to align and translate.ICLR 2015)
Defined in this way attention You can follow the specified latitude , In order to better model the direction of space and time , The author defines... In time latitude and space latitude respectively attention: A o S ( ⋅ ) AoS(\cdot) AoS(⋅) and A o T ( ⋅ ) AoT(\cdot) AoT(⋅)
- Locate Module
Mainly for the generation of object words, Such as “man”、“basketball” etc. . Modules need to pay attention in time and space region Information , Therefore, the author will first V o V_o Vo Send in A o S ( ⋅ ) AoS(\cdot) AoS(⋅), And then with V a V_a Va Send together A o T ( ⋅ ) AoT(\cdot) AoT(⋅)
there ⨁ \bigoplus ⨁ Express concate operation - Relate Module
Mainly to generate verbs , for example “shoting”、“riding” etc. . In the picture shown below , To generate verbs “shoting”, Models need to be aware of different scenarios object Change of state , So in Relate Module In any pair of spaces attention After processing the V o V_o Vo Paired , Then the execution time attention - Func Module
It is mainly to generate some conjunctions to make the whole sentence coherent , Such as “of”,“and” etc. . There is no need for visual information , Only language information is needed , So right. decoder LSTM The history of cell states perform AoT
It can be found that these three modules are closely around the first mentioned in this section attention The operation is in progress , take h t e n h_t^{en} hten As attention Of Q.
Module Selector
In the generation module , every last step Generated word It can only be one of the above three modules , Therefore, we need to design a selection module to choose . The specific implementation is to score each module , Then choose the highest score . The scoring function is designed as follows :
But because of max The function is non differentiable , So the author uses an approximate method to one-hot vector z t z_t zt Convert to continuous values z t ~ \tilde {z_t} zt~
The final result of visual reasoning is :
there ⨂ \bigotimes ⨂ Express inner product
Decoder
Used one LSTM decode , Input is the result of visual information v t v_t vt,encoder The hidden layer of
Then the visual information 、 Hidden layer information follows MLP Output the probability distribution of the corresponding dictionary to get the generated word
Training
- Caption Loss:cross-entropy loss
Used to measure the accuracy of generated sentences
T T T Indicates the length of the sentence - POS Loss:KLD loss
Used to measure the accuracy of the selection module , Specifically, put the sentence POS Convert to one-hot code , And then use KLD(Kullback-Leibler Divergence) loss To measure the similarity of two distributions . The actual implementation in code is also used cross-entropy loss - The final loss
边栏推荐
- 2020, the regular expression for mobile phone verification of the latest mobile phone number is continuously updated
- 从零开始学 MySQL —数据库和数据表操作
- Supervarimag superconducting magnet system SVM series
- Junit单元测试框架详解
- Lumiprobe 细胞成像研究丨PKH26细胞膜标记试剂盒
- How to realize the applet in its own app to realize continuous live broadcast
- Technical secrets of ByteDance data platform: implementation and optimization of complex query based on Clickhouse
- 生鲜行业B2B电商平台解决方案,提高企业交易流程标准化和透明度
- 【Go ~ 0到1 】 第五天 7月1 类型别名,自定义类型,接口,包与初始化函数
- How to realize the bottom layer of read-write lock in go question bank 16
猜你喜欢
MySQL common graphics management tools | dark horse programmers
Lumiprobe free radical analysis h2dcfda instructions
Lumiprobe 自由基分析丨H2DCFDA说明书
论文泛读【FiLM: Visual Reasoning with a General Conditioning Layer】
【森城市】GIS数据漫谈(一)
Superoptimag superconducting magnet system - SOM, Som2 series
[to.Net] C set class source code analysis
制造业SRM管理系统供应商全方位闭环管理,实现采购寻源与流程高效协同
机械设备行业数字化供应链集采平台解决方案:优化资源配置,实现降本增效
Lake Shore—OptiMag 超导磁体系统 — OM 系列
随机推荐
Golang error handling
3. "Create your own NFT collections and publish a Web3 application to show them" cast NFT locally
Redis 实现限流的三种方式
Shell array
[to.Net] C set class source code analysis
寶,運維100+服務器很頭疼怎麼辦?用行雲管家!
微服务大行其道的今天,Service Mesh是怎样一种存在?
The best landing practice of cave state in an Internet ⽹⾦ financial technology enterprise
SuperVariMag 超导磁体系统 — SVM 系列
ETL development of data warehouse (IV)
有关 M91 快速霍尔测量仪的更多信息
从零开始学 MySQL —数据库和数据表操作
M91 fast hall measuring instrument - better measurement in a shorter time
精益思想:来源,支柱,落地。看了这篇文章就懂了
Viewing technological changes through Huawei Corps (VI): smart highway
使用环信提供的uni-app Demo,快速实现一对一单聊
Nacos configuration file publishing failed, please check whether the parameters are correct solution
Is PMP cancelled??
中英说明书丨人可溶性晚期糖基化终末产物受体(sRAGE)Elisa试剂盒
Lumiprobe 自由基分析丨H2DCFDA说明书