当前位置：网站首页>Inter class and intra class relations in video classification -- regularization

Inter class and intra class relations in video classification -- regularization

2022-06-12 11:38:00 【Large roasted wings】

Abstract

This paper proposes a new framework , Jointly learning feature relations and using class relations to improve video classification performance . To be specific , These two types of relationships are achieved through neural networks in depth （DNN） Strictly implement Regularization To learn and use .

contribution ：

We suggest that DNN The fusion layer in is used for structural regularization , To identify the correlation of multiple features , While maintaining its diversity . This unique ability makes the proposed method different from most existing work , These works usually use the shallow fusion process , Without considering the in-depth exploration of feature correlation .
We also propose to adopt the policy of DNN The output layer of applies similar structural regularization to explore the relationship between classes . therefore , Cross functional and inter class relationships are formulated and explored in a unified framework , This framework can be easily used GPU Realization , And train at an affordable time cost .
We provide a large number of empirical evaluations , To confirm the effectiveness of the proposed framework in detail , And we have achieved the highest performance to date on a widely used practical benchmark .

Introduction

In view of the limitations of the existing work , This paper presents a depth neural network （DNN） A unified framework for , The framework is joint learning Characteristic relationship and Class relation , At the same time, we use the learned relations to classify videos in the same framework . chart 1 The concept diagram of the method proposed in this paper is given .

First , Extract various video features , Including local visual descriptor and audio descriptor . Then use the feature as DNN The input of , The first two layers are input layer and feature transformation layer . The third layer of the network is called the convergence layer (Fusion Layer), Among them, structural regularization is applied to the network weights , To identify and utilize feature relationships . To be specific , The design of regularization term is based on the observation of two natural attributes of the relationship between features , Relevance and diversity . The former means that different features may share some common patterns in the middle layer representation between the original feature and the high-level semantics . The latter emphasizes the uniqueness of different characteristics , As supplementary information for predicting video semantics . These two attributes are modeled by using the characteristic correlation matrix , Let's analyze the fusion weight Trace norm regularization , To reveal the hidden relevance and diversity of features .
For inter class relationships , We regularize the weights of the final output layer , To automatically identify the grouping structure of video classes and exception classes . Semantic classes in the same group have common ground or relevance , It can be used as knowledge sharing to improve classification performance , Outliers should be excluded from negative knowledge sharing . We will show , By applying a similar trace norm based regularization to the weights of the output layer , We can effectively explore such complex inter class relationships , To produce better video classification results .

Please note that , Raw video data can be used as input , Not handmade features , Such as the recent work on image classification using deep learning [22]. under these circumstances , Convolutional neural networks can be used （CNN） Feature extraction from raw data . There are two reasons for using handmade features in our proposed framework . First , Handmade features have been widely used in video classification , And it is still the core component of some video analysis systems , These systems have produced the latest results in human behavior recognition and event recognition . By using these features , It is easy to compare with traditional semantic classification methods （ Popular SVM classifier ） Make a fair comparison . secondly , Using neural networks to extract features requires more layers of neurons , These neurons need to adjust a lot of additional parameters , More training data is needed .

Please note that , In many video classification tasks , The amount of available training data is far from enough to train neural networks with too many layers . therefore , In this paper , We will propose regularization DNN For feature fusion and video semantic classification . As far as we know , This work is the first attempt at DNN Capture features and class relationships for video classification .

chart 1： Based on DNN Overview of video classification framework . First, we extract various kinds of vision / Audio features , Then use it as DNN The input of . Before fusion , Use a layer of neurons to transform features （ abstract ）. In the fusion layer , Regularize the network parameters , To ensure that different features can share the correlation dimension , While maintaining its unique characteristics . As shown in the figure , Some dimensions of different characteristics may be highly correlated （ The thick line points to the same neuron ）. then , Regularize the weights between the fusion layer and the output layer , To identify category groups . The learned relationships between features and between classes are used to improve classification performance

3 Methodology

3.1 Symbols and problem descriptions

Suppose we get a containing N A training set of video clips , These video clips are related to C Semantic classes . here , Each video clip is composed of M Two different features represent , For example, various visual and audio descriptors . therefore , We can represent each training sample as a （M+1） Tuples ：

among $x_{n}^{m}$ It means the first one n The... Of video samples m Two features represent . If the first n Video samples and c Semantic class associations , be $y_{n}=\left [ y_{n1},y_{n2},...,y_{nC} \right ]^{T} \in \mathbb{R}^{C}$ Is the corresponding semantic tag , Among them the first c Elements $y_{nC}$ =1.

The goal is to train the prediction model , In order to classify the new test video . A simple method is to train a classifier for each semantic class , And different features can be combined using early fusion or late fusion schemes . However , Such an independent training strategy does not explore internal features and inter class relationships . Here it is , We propose a method based on DNN Video classification model , By exploring the relevance and diversity of multiple features , Realize feature sharing in the fusion layer , Pictured 1 Shown . Besides , The prediction layer of our deep neural network is also regularized , To enhance knowledge sharing between different categories . therefore , Clearly explore these two relationships in a unified learning process . below , Let's start with a single function standard DNN, Then we introduce our proposed regularization DNN Details of .

3.2 Single feature DNN Study

Inspired by the biological nervous system ,DNN Use a large number of interconnected neurons to build complex computational models . This method uses multilayer tissue neurons , It has strong nonlinear abstraction ability , As long as there is enough training data , You can learn any mapping function from input to output . below , Let's briefly review a standard DNN, There is only one feature as input , namely M=1. In total there are L Layer of DNN in , We mean $a_{l-1}$ and $a_{l}$ As a single feature Input and output of layer , l=1,...,L . and $W_{l}$ and $b_{l}$ Separate indication control Layer weight matrix and deviation vector . From l-1 Floor to l Layers can be represented as ：

here σ（·） Is a non-linear sigmoid function , Usually defined as ：

chart 2（a） and （b） Two types of four layer neural networks using a single feature as input are shown . To derive the best weight for each layer , The following optimization problems can be formulated ：

among , The first part is through the network $\hat{y}_{i}=a_{L}=f(x_{i})$ The output of ground truth label $y_{i}$ The difference between them is added to measure the experience loss on the training data , The second part is the regularization term to prevent over fitting . For the sake of simplicity , We can add additional dimensions to eigenvectors with constant values , take b Absorb the weight coefficient W in .

chart 2： Diagram of different neural network structures .（b） It is the most popular structure in multi class prediction , and （d） In image [42] Such works are used to combine multiple features , The features are processed separately in the network , Then merge through the middle tier . In this paper , We are right. （d） The same structure shown in , To explore the relationship between features and classes .

3.3 Regularization and characteristic relation

In some cases , Based on a single feature DNN Can be very powerful . However , It can only be used in one aspect of the data to perform semantic prediction . For complex data like video , Semantic information can be carried by different feature representations , Including visual and audio cues . Please note that , Because the intrinsic relationship between multiple feature representations is ignored , A simple fusion strategy （ Such as early or late fusion ） This usually results in limited performance gains [4]. Besides , This simple fusion method usually requires additional training classifiers . therefore , Want to get a compact and meaningful fusion representation , Make full use of the complementary clues of various features . below , We will basically DNN Extended to a regularization variant , The variant can adapt to the deep fusion process of multiple features .

We got a total of M Features ： For each video sample . Driven by the process of multisensor integration of primary neurons in biological systems , We recommend using an additional layer to fuse all features , Pictured 1 Shown . therefore , The transition equation of the fusion layer can be written as follows ：

among ,E and F The last layer of feature extraction and the index of fusion layer （ namely ,F=E+1）. Here is $a_{E}^{m} \in \mathbb{R}^{P}$ Represents the extracted number m An intermediate representation of a feature , The feature is first weighted $W_{E}^{m}$ Make a linear transformation , And then use sigmoid Nonlinear mapping of functions to new representations $a_{F}$ .

Since all feature representations correspond to the same video data , So it's easy to understand , Various features can be used to reveal common potential patterns related to video semantics . Besides , As mentioned earlier , Different features can also be complementary , Because they have different characteristics . therefore , The fusion process should aim to capture the relationship between features , At the same time, it can retain its unique characteristics . We don't simply add multiple feature information , Instead, an objective function is specially formulated to regulate the fusion process , To explore the correlation and diversity among multiple features at the same time . especially , Convert all features to shared representation weights $W_{E}^{1},...,W_{E}^{m}$ First, they are vectorized into P Dimension vector , among P yes $a_{E}^{m},m=1,...,M$ Peace-keeping $a_{F}$ The product of dimensions . here , We assume that the extracted features have the same dimension . Then we add these coefficient vectors to a matrix $W_{E}\in\mathbb{R}^{P\times M}$ , among $W_{E}$ The weight of each column of the corresponding single feature . therefore , Elements $W_{E}(i,j)$ Given as ：

then , We can design regularization with the following goals DNN：

among

And the standard single characteristic neural network formula 2 Compared with the objective function in , The above cost function contains an additional regularization term . Please note that , matrix $W_{E}$ A coefficient representing all features . Here we use symmetric positive semidefinite matrices $\psi \in\mathbb{R}^{M\times M}$ Modeling the correlation between features , The last regularization term is introduced by using trace norm , This helps to learn the relationship between features 【12,52】. Please note that ,ψ Items with large median value indicate strong feature correlation , Items with smaller values represent differences between different features , Because they are less relevant . coefficient λ1 and λ2 Control contributions from different regularization terms . Last , Will learn regularization DNN As a weight matrix W And characteristic correlation matrix ψ Joint optimization process of .

3.4 Regularization of class relations

To identify or classify C There are two semantic categories , You can simply use one to many strategies to train independently C A classifier . chart 2（a） and （c） The total settings for single feature and multi feature are described respectively c A single to many training scheme for four layer neural networks . obviously , these C Each of the neural networks is learned individually , It completely ignores the knowledge sharing between different semantic categories . However , as everyone knows , Video semantics also have something in common , This indicates that some semantic categories may have a strong correlation [19,36]. therefore , It is important to explore this commonality by learning multiple video semantics at the same time , This usually leads to better learning performance . Be careful , The commonness among multiple categories is usually represented by the sharing of parameters between different prediction models [3,26]. Compared with the current popular support vector machine method ,DNN It's more natural to do multiple types of training at the same time . Pictured 2（b） Shown , By adopting a set of C unit , Based on a single feature DNN It can be easily extended to multi class problems , This structure has been widely used . Subject to standard MTL Method [3,26] Inspired by the regularization framework used in , Here we propose a regularization DNN, It aims to train multiple classifiers at the same time , At the same time, explore the class relationship . To enhance semantic sharing , We will set the standard DNN The original target of is expanded to the following form ：

Be careful , Some of the old MTL The working assumption is that the class relationship is given explicitly , And can be used as prior knowledge [26], And our method doesn't need to do this . according to MTL Convex formula of [52], Here we are in the coefficient $W_{L-1}$ On the output layer of a trace norm regular term , The class relation is extended to matrix variables $\Omega \in \mathbb{R}^{C\times C}$ . Attention constraint $\Omega \succeq 0$ Denotes that the class relation matrix is positive semidefinite , Because it can be regarded as a similarity measure of semantic classes . coefficient λ1 and λ2 It's a regularization parameter . In the learning process , Optimal weight matrix {Wl}Ll=1 And class relation matrix Ω Simultaneous export .

3.5 Joint objectives

In order to unify the above objectives into a joint framework , We now propose a new DNN The formula , This formula explores the relationship between features and classes . In our framework , We use a layer of neurons to fuse multiple features , The aim is to bridge the gap between low-level features and high-level video semantics . At the last level of generating forecasts , We apply tracking norm regularization between different semantics , In order to better learn the prediction of multiple classes . In Mathematics , We will equation 4 Characteristic regularization and equation in 5 The classes in are regularized and merged into the following objective function ：

among λ1、λ2 and λ3 It's a regularization parameter . And equation 2 Compared to the original target in , We have two trace norm regularization terms , It is used for the fusion of multiple features and the exploration of the relationship between classes . Two additional constraints tr（ψ）=1 and tr(Ω) = 1 Used to limit complexity , Such as [52] Shown . Last , The above cost function is relative to the network weight {Wl}Ll=1、 Relationship matrix between features ψ And the inter class correlation matrix Ω.