当前位置:网站首页>Multimodal learning pooling with context gating for video classification
Multimodal learning pooling with context gating for video classification
2022-06-29 06:59:00 【Programmers who only know git clone】
Preface
Address of thesis :arxiv
Code address :github
This is a video understanding article paper, The main reason for the multimodality is that the structure combines video embedding, Audio embedding And so on , It can be said to be multimodal fusion .
notes : The paper Won Youtube 8M Kaggle Large-Scale Video understading The champion of the game , Address of the competition : Portal .
Series articles
Updating …
motivation
There seems to be no motivation in the competition articles ? After reading the profile, I felt that I was basically influenced by others paper Inspired by the , Then try the effect of a certain structure in this field , Find out work I used it .
- The game provides the image frame features corresponding to the video , And audio features , Therefore, this paper does not contribute to feature extraction
- Based on the previous , This paper mainly contributes to the direction of feature fusion , The previous methods mainly used LSTM perhaps GRU Time series feature modeling , There are other methods that do not model time series, so they can directly use simple sum、meam Or something more complicated BOW、VLAD And so on . The author of this paper mainly undertakes BOW、VLAD And so on .
- suffer LSTM、GRU Inspired by the door control unit , The author designs a video classification architecture , Combine non temporal aggregation with gating mechanism , Is the following text context gating layer .
structure

The complete structure is very simple , For ease of understanding , This paper introduces the structure of the thesis in a modular way , First introduce the overall structure, and then introduce the specific implementation of each module .
1、 The first is light blue video features, It can be understood that it is the image feature of video frame extraction provided by the competition , For example, a video is fixed 10 frame , Then suppose we use a typical resnet50 Extracting image features is generally 2048 The vector of the dimension , So we can understand the video features It's just one. (10,2048) Characteristics of .
2、audio features It should generally be that the whole audio is converted into embedding, Sometimes the audio is long and features may be extracted in several segments , Let's assume that it's divided 5 paragraph , The feature latitude extracted from each audio segment 1024, So we have to pour (5,1024) Characteristics of .
3、 Green learnable pooling part , It is the author mentioned in the motivation who has tried a variety of feature fusion methods , Such as the BOW,VLAD,NetVLAD etc. , The input to this module is (N,D) The shape of the , The output is (1,d) The shape of the , Lowercase d The reason is that the latitude of input and output is not necessarily the same, but can be any dimension , Is to put N The three features merge into 1 I mean .
4、 Therefore, from the perspective of image features pooling The module got (1,d_v) And audio features pooling The module got (1,d_a) The characteristics of the are carried out at the last latitude concat Operate to get the fusion features of video and audio (1,d_a+d_v).
5、 This concat The features of are sent to the green in the figure FC layer in , This FC layer After reading the source code of the author, it is a full connection layer plus BN Layer and activation function (FC+BN+relu6).
6、 And then to context Gating Layer , This is what you copied GLU Door control layer , original text : The authors hope to introduce a nonlinear interaction between the activation of input representations . secondly , Different activation values of inputs that wish to be recalibrated by the automatic gating mechanism . People words : Use the activation function to select how many input features need to be retained .
7、 It's out context Gating Back to MOE in ,MOE It is easy to understand how to use input to different expert The output of is weighted .
8、MOE The output is connected to another context Gating.
9、 Finally, the output features are classified and calculated loss.
Structure code example I implemented :
Basically is 1:1 According to the structure of the paper .
Module details
NetVLAD
These two statements are quite clear :
1、 You know NetVLAD
2、 Paper notes :NetVLAD: CNN architecture for weakly supervised place recognition
FC layer
Go straight to the code :
self.MLP = nn.Sequential(
nn.Linear(in_dim, out_dim),
nn.BatchNorm1d(out_dim),
nn.ReLU6()
)
context gating
The formula :
X It's the characteristics of input ,WX+b Will be X To a full connectivity layer ,f Is a nonlinear function , for instance sigmoid perhaps relu etc. , The author uses sigmoid.
MOE

MOE Is to feed input into n individual expert, Every expert With the same structure but different parameters , You can get n Output , Then the input goes through a full connection layer and the output is n A score is right n individual experts The output of is weighted to get the final output .
class Expert_model(nn.Module):
def __init__(self, input_size, output_size, hidden_size):
super(Expert_model, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
self.relu = nn.ReLU()
self.log_soft = nn.LogSoftmax(1)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
out = self.log_soft(out)
return out
class MOE(nn.Module):
def __init__(self, input_size, output_size, expert_num, hidden_size=64):
super().__init__()
self.input_size = input_size
self.output_size = output_size
self.expert_num = expert_num
self.hidden_size = hidden_size
self.experts = nn.ModuleList(
[Expert_model(self.input_size, self.output_size, self.hidden_size) for i in range(self.expert_num)])
self.w_gate = nn.Linear(self.input_size, self.expert_num)
def forward(self, x):
# I like to add one sigmoid take gate The output of returns to 0-1 Between , Can prevent the gradient from disappearing
gate_weight = self.w_gate(x).sigmoid().softmax(dim=-1) # bs, expert_num
expert_outputs = [self.experts[i](x) for i in range(self.expert_num)]
expert_outputs = torch.stack(expert_outputs) # expert_num, bs, dim
gate_weight_expert = torch.einsum("bn,nbd->bd", gate_weight, expert_outputs)
return gate_weight_expert
experiment
The above is the complete structure of the thesis , The experimental results show some improvement , But a serious phenomenon was found according to the training and verification loss Change found this structure particularly easy to over fit .
So I tried in MOE Of FC Add dropout layer , And others FC Adding after layer Dropout Layers are all valid , Obviously, the over fitting , And the index still has a certain increase ~
边栏推荐
- 消息队列之通过幂等设计和原子锁避免重复退款
- 多线程工具类 CompletableFuture
- Daily question - force deduction - multiply the found value by 2
- Chapter IV introduction to FPGA development platform
- 大型化工企业数字化转型建议
- Introduction to QT qfileinfo
- QT program packaging and publishing windeployqt tool
- Vite quick start
- 更改主机名的方法(永久)
- 软件工程师与软件开发区别? Software Engineer和Software Developer区别?
猜你喜欢

作为一名合格的网工,你必须掌握的 DHCP Snooping 知识!

Illustrate plug-in -- AI plug-in development -- creative plug-in -- astute graphics -- length and angle measurement function

. NETCORE uses redis to limit the number of interface accesses

Introduction to Ceres Quartet

二叉树的迭代法前序遍历的两种方法

Share 10 interview questions related to JS promise

施工企业选择智慧工地的有效方法

MySQL learning notes

开源二三事|ShardingSphere 与 Database Mesh 之间不得不说的那些事

Analytic hierarchy process
随机推荐
Object detection - VIDEO reasoning using yolov6
json tobean
WDCP accesses all paths that do not exist and jumps to the home page without returning 404
融入STEAM教育的劳动技能课程
Qt QFileInfo简介
Daily question 1 - force deduction - there are three consecutive arrays of odd numbers
json tobean
Illustrate plug-in -- AI plug-in development -- creative plug-in -- astute graphics -- multi axis mirroring function
QT custom bit operation class
大型化工企业数字化转型建议
Presto-Trial
NoSQL数据库之Redis(五):Redis_Jedis_测试
How to change the password after forgetting the MySQL password (the latest version of 2022 detailed tutorial nanny level)
try anbox (by quqi99)
Li Kou daily question - day 30 -1281 Difference of sum of bit product of integer
As a qualified network worker, you must master DHCP snooping knowledge!
力扣今日题-324. 摆动排序 II
Chapter V online logic analyzer signaltap
jetson tx2
用机器人教育创造新一代生产和服务工具