当前位置：网站首页>Multimodal learning pooling with context gating for video classification

Multimodal learning pooling with context gating for video classification

2022-06-29 06:59:00 【Programmers who only know git clone】

Preface

Address of thesis ：arxiv
Code address ：github

This is a video understanding article paper, The main reason for the multimodality is that the structure combines video embedding, Audio embedding And so on , It can be said to be multimodal fusion .

notes ： The paper Won Youtube 8M Kaggle Large-Scale Video understading The champion of the game , Address of the competition ： Portal .

Series articles

Updating …

motivation

There seems to be no motivation in the competition articles ？ After reading the profile, I felt that I was basically influenced by others paper Inspired by the , Then try the effect of a certain structure in this field , Find out work I used it .

The game provides the image frame features corresponding to the video , And audio features , Therefore, this paper does not contribute to feature extraction
Based on the previous , This paper mainly contributes to the direction of feature fusion , The previous methods mainly used LSTM perhaps GRU Time series feature modeling , There are other methods that do not model time series, so they can directly use simple sum、meam Or something more complicated BOW、VLAD And so on . The author of this paper mainly undertakes BOW、VLAD And so on .
suffer LSTM、GRU Inspired by the door control unit , The author designs a video classification architecture , Combine non temporal aggregation with gating mechanism , Is the following text context gating layer .

structure

Insert picture description here
The complete structure is very simple , For ease of understanding , This paper introduces the structure of the thesis in a modular way , First introduce the overall structure, and then introduce the specific implementation of each module .

1、 The first is light blue video features, It can be understood that it is the image feature of video frame extraction provided by the competition , For example, a video is fixed 10 frame , Then suppose we use a typical resnet50 Extracting image features is generally 2048 The vector of the dimension , So we can understand the video features It's just one. (10,2048) Characteristics of .
2、audio features It should generally be that the whole audio is converted into embedding, Sometimes the audio is long and features may be extracted in several segments , Let's assume that it's divided 5 paragraph , The feature latitude extracted from each audio segment 1024, So we have to pour (5,1024) Characteristics of .
3、 Green learnable pooling part , It is the author mentioned in the motivation who has tried a variety of feature fusion methods , Such as the BOW,VLAD,NetVLAD etc. , The input to this module is (N,D) The shape of the , The output is (1,d) The shape of the , Lowercase d The reason is that the latitude of input and output is not necessarily the same, but can be any dimension , Is to put N The three features merge into 1 I mean .
4、 Therefore, from the perspective of image features pooling The module got (1,d_v) And audio features pooling The module got (1,d_a) The characteristics of the are carried out at the last latitude concat Operate to get the fusion features of video and audio (1,d_a+d_v).
5、 This concat The features of are sent to the green in the figure FC layer in , This FC layer After reading the source code of the author, it is a full connection layer plus BN Layer and activation function （FC+BN+relu6）.
6、 And then to context Gating Layer , This is what you copied GLU Door control layer , original text ： The authors hope to introduce a nonlinear interaction between the activation of input representations . secondly , Different activation values of inputs that wish to be recalibrated by the automatic gating mechanism . People words ： Use the activation function to select how many input features need to be retained .
7、 It's out context Gating Back to MOE in ,MOE It is easy to understand how to use input to different expert The output of is weighted .
8、MOE The output is connected to another context Gating.
9、 Finally, the output features are classified and calculated loss.

Structure code example I implemented ：
Insert picture description here
Basically is 1:1 According to the structure of the paper .

Module details

NetVLAD

These two statements are quite clear ：
1、 You know NetVLAD
2、 Paper notes ：NetVLAD: CNN architecture for weakly supervised place recognition

FC layer

Go straight to the code ：

self.MLP = nn.Sequential(
            nn.Linear(in_dim, out_dim),
            nn.BatchNorm1d(out_dim),
            nn.ReLU6()
        )

context gating

The formula ：
Insert picture description here
X It's the characteristics of input ,WX+b Will be X To a full connectivity layer ,f Is a nonlinear function , for instance sigmoid perhaps relu etc. , The author uses sigmoid.

MOE

Insert picture description here
MOE Is to feed input into n individual expert, Every expert With the same structure but different parameters , You can get n Output , Then the input goes through a full connection layer and the output is n A score is right n individual experts The output of is weighted to get the final output .

class Expert_model(nn.Module):
    def __init__(self, input_size, output_size, hidden_size):
        super(Expert_model, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.log_soft = nn.LogSoftmax(1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.log_soft(out)
        return out

class MOE(nn.Module):

    def __init__(self, input_size, output_size, expert_num, hidden_size=64):
        super().__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.expert_num = expert_num
        self.hidden_size = hidden_size

        self.experts = nn.ModuleList(
            [Expert_model(self.input_size, self.output_size, self.hidden_size) for i in range(self.expert_num)])

        self.w_gate = nn.Linear(self.input_size, self.expert_num)

    def forward(self, x):
        #  I like to add one sigmoid take gate The output of returns to 0-1 Between , Can prevent the gradient from disappearing 
        gate_weight = self.w_gate(x).sigmoid().softmax(dim=-1) # bs, expert_num
        expert_outputs = [self.experts[i](x) for i in range(self.expert_num)]
        expert_outputs = torch.stack(expert_outputs) # expert_num, bs, dim

        gate_weight_expert = torch.einsum("bn,nbd->bd", gate_weight, expert_outputs)
        return gate_weight_expert

experiment

The above is the complete structure of the thesis , The experimental results show some improvement , But a serious phenomenon was found according to the training and verification loss Change found this structure particularly easy to over fit .
So I tried in MOE Of FC Add dropout layer , And others FC Adding after layer Dropout Layers are all valid , Obviously, the over fitting , And the index still has a certain increase ～

原网站

版权声明
本文为[Programmers who only know git clone]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206290514530612.html