当前位置:网站首页>Inter class and intra class relations in video classification -- regularization
Inter class and intra class relations in video classification -- regularization
2022-06-12 11:38:00 【Large roasted wings】
Abstract
This paper proposes a new framework , Jointly learning feature relations and using class relations to improve video classification performance . To be specific , These two types of relationships are achieved through neural networks in depth (DNN) Strictly implement Regularization To learn and use .
contribution :
- We suggest that DNN The fusion layer in is used for structural regularization , To identify the correlation of multiple features , While maintaining its diversity . This unique ability makes the proposed method different from most existing work , These works usually use the shallow fusion process , Without considering the in-depth exploration of feature correlation .
- We also propose to adopt the policy of DNN The output layer of applies similar structural regularization to explore the relationship between classes . therefore , Cross functional and inter class relationships are formulated and explored in a unified framework , This framework can be easily used GPU Realization , And train at an affordable time cost .
- We provide a large number of empirical evaluations , To confirm the effectiveness of the proposed framework in detail , And we have achieved the highest performance to date on a widely used practical benchmark .
Introduction
In view of the limitations of the existing work , This paper presents a depth neural network (DNN) A unified framework for , The framework is joint learning Characteristic relationship and Class relation , At the same time, we use the learned relations to classify videos in the same framework . chart 1 The concept diagram of the method proposed in this paper is given .
First , Extract various video features , Including local visual descriptor and audio descriptor . Then use the feature as DNN The input of , The first two layers are input layer and feature transformation layer . The third layer of the network is called the convergence layer (Fusion Layer), Among them, structural regularization is applied to the network weights , To identify and utilize feature relationships . To be specific , The design of regularization term is based on the observation of two natural attributes of the relationship between features , Relevance and diversity . The former means that different features may share some common patterns in the middle layer representation between the original feature and the high-level semantics . The latter emphasizes the uniqueness of different characteristics , As supplementary information for predicting video semantics . These two attributes are modeled by using the characteristic correlation matrix , Let's analyze the fusion weight Trace norm regularization , To reveal the hidden relevance and diversity of features .
For inter class relationships , We regularize the weights of the final output layer , To automatically identify the grouping structure of video classes and exception classes . Semantic classes in the same group have common ground or relevance , It can be used as knowledge sharing to improve classification performance , Outliers should be excluded from negative knowledge sharing . We will show , By applying a similar trace norm based regularization to the weights of the output layer , We can effectively explore such complex inter class relationships , To produce better video classification results .
Please note that , Raw video data can be used as input , Not handmade features , Such as the recent work on image classification using deep learning [22]. under these circumstances , Convolutional neural networks can be used (CNN) Feature extraction from raw data . There are two reasons for using handmade features in our proposed framework . First , Handmade features have been widely used in video classification , And it is still the core component of some video analysis systems , These systems have produced the latest results in human behavior recognition and event recognition . By using these features , It is easy to compare with traditional semantic classification methods ( Popular SVM classifier ) Make a fair comparison . secondly , Using neural networks to extract features requires more layers of neurons , These neurons need to adjust a lot of additional parameters , More training data is needed .
Please note that , In many video classification tasks , The amount of available training data is far from enough to train neural networks with too many layers . therefore , In this paper , We will propose regularization DNN For feature fusion and video semantic classification . As far as we know , This work is the first attempt at DNN Capture features and class relationships for video classification .

chart 1: Based on DNN Overview of video classification framework . First, we extract various kinds of vision / Audio features , Then use it as DNN The input of . Before fusion , Use a layer of neurons to transform features ( abstract ). In the fusion layer , Regularize the network parameters , To ensure that different features can share the correlation dimension , While maintaining its unique characteristics . As shown in the figure , Some dimensions of different characteristics may be highly correlated ( The thick line points to the same neuron ). then , Regularize the weights between the fusion layer and the output layer , To identify category groups . The learned relationships between features and between classes are used to improve classification performance
3 Methodology
3.1 Symbols and problem descriptions
Suppose we get a containing N A training set of video clips , These video clips are related to C Semantic classes . here , Each video clip is composed of M Two different features represent , For example, various visual and audio descriptors . therefore , We can represent each training sample as a (M+1) Tuples :

among
It means the first one n The... Of video samples m Two features represent . If the first n Video samples and c Semantic class associations , be
Is the corresponding semantic tag , Among them the first c Elements
=1.
The goal is to train the prediction model , In order to classify the new test video . A simple method is to train a classifier for each semantic class , And different features can be combined using early fusion or late fusion schemes . However , Such an independent training strategy does not explore internal features and inter class relationships . Here it is , We propose a method based on DNN Video classification model , By exploring the relevance and diversity of multiple features , Realize feature sharing in the fusion layer , Pictured 1 Shown . Besides , The prediction layer of our deep neural network is also regularized , To enhance knowledge sharing between different categories . therefore , Clearly explore these two relationships in a unified learning process . below , Let's start with a single function standard DNN, Then we introduce our proposed regularization DNN Details of .
3.2 Single feature DNN Study
Inspired by the biological nervous system ,DNN Use a large number of interconnected neurons to build complex computational models . This method uses multilayer tissue neurons , It has strong nonlinear abstraction ability , As long as there is enough training data , You can learn any mapping function from input to output . below , Let's briefly review a standard DNN, There is only one feature as input , namely M=1. In total there are L Layer of DNN in , We mean
and
As a single feature
Input and output of layer ,
. and
and
Separate indication control
Layer weight matrix and deviation vector . From l-1 Floor to l Layers can be represented as :

here σ(·) Is a non-linear sigmoid function , Usually defined as :

chart 2(a) and (b) Two types of four layer neural networks using a single feature as input are shown . To derive the best weight for each layer , The following optimization problems can be formulated :

among , The first part is through the network
The output of ground truth label
The difference between them is added to measure the experience loss on the training data , The second part is the regularization term to prevent over fitting . For the sake of simplicity , We can add additional dimensions to eigenvectors with constant values , take b Absorb the weight coefficient W in .

chart 2: Diagram of different neural network structures .(b) It is the most popular structure in multi class prediction , and (d) In image [42] Such works are used to combine multiple features , The features are processed separately in the network , Then merge through the middle tier . In this paper , We are right. (d) The same structure shown in , To explore the relationship between features and classes .
3.3 Regularization and characteristic relation
In some cases , Based on a single feature DNN Can be very powerful . However , It can only be used in one aspect of the data to perform semantic prediction . For complex data like video , Semantic information can be carried by different feature representations , Including visual and audio cues . Please note that , Because the intrinsic relationship between multiple feature representations is ignored , A simple fusion strategy ( Such as early or late fusion ) This usually results in limited performance gains [4]. Besides , This simple fusion method usually requires additional training classifiers . therefore , Want to get a compact and meaningful fusion representation , Make full use of the complementary clues of various features . below , We will basically DNN Extended to a regularization variant , The variant can adapt to the deep fusion process of multiple features .
We got a total of M Features :
For each video sample . Driven by the process of multisensor integration of primary neurons in biological systems , We recommend using an additional layer to fuse all features , Pictured 1 Shown . therefore , The transition equation of the fusion layer can be written as follows :

among ,E and F The last layer of feature extraction and the index of fusion layer ( namely ,F=E+1). Here is
Represents the extracted number m An intermediate representation of a feature , The feature is first weighted
Make a linear transformation , And then use sigmoid Nonlinear mapping of functions to new representations
.
Since all feature representations correspond to the same video data , So it's easy to understand , Various features can be used to reveal common potential patterns related to video semantics . Besides , As mentioned earlier , Different features can also be complementary , Because they have different characteristics . therefore , The fusion process should aim to capture the relationship between features , At the same time, it can retain its unique characteristics . We don't simply add multiple feature information , Instead, an objective function is specially formulated to regulate the fusion process , To explore the correlation and diversity among multiple features at the same time . especially , Convert all features to shared representation weights
First, they are vectorized into P Dimension vector , among P yes
Peace-keeping
The product of dimensions . here , We assume that the extracted features have the same dimension . Then we add these coefficient vectors to a matrix
, among
The weight of each column of the corresponding single feature . therefore , Elements
Given as :
then , We can design regularization with the following goals DNN:

among
And the standard single characteristic neural network formula 2 Compared with the objective function in , The above cost function contains an additional regularization term . Please note that , matrix
A coefficient representing all features . Here we use symmetric positive semidefinite matrices
Modeling the correlation between features , The last regularization term is introduced by using trace norm , This helps to learn the relationship between features 【12,52】. Please note that ,ψ Items with large median value indicate strong feature correlation , Items with smaller values represent differences between different features , Because they are less relevant . coefficient λ1 and λ2 Control contributions from different regularization terms . Last , Will learn regularization DNN As a weight matrix W And characteristic correlation matrix ψ Joint optimization process of .
3.4 Regularization of class relations
To identify or classify C There are two semantic categories , You can simply use one to many strategies to train independently C A classifier . chart 2(a) and (c) The total settings for single feature and multi feature are described respectively c A single to many training scheme for four layer neural networks . obviously , these C Each of the neural networks is learned individually , It completely ignores the knowledge sharing between different semantic categories . However , as everyone knows , Video semantics also have something in common , This indicates that some semantic categories may have a strong correlation [19,36]. therefore , It is important to explore this commonality by learning multiple video semantics at the same time , This usually leads to better learning performance . Be careful , The commonness among multiple categories is usually represented by the sharing of parameters between different prediction models [3,26]. Compared with the current popular support vector machine method ,DNN It's more natural to do multiple types of training at the same time . Pictured 2(b) Shown , By adopting a set of C unit , Based on a single feature DNN It can be easily extended to multi class problems , This structure has been widely used . Subject to standard MTL Method [3,26] Inspired by the regularization framework used in , Here we propose a regularization DNN, It aims to train multiple classifiers at the same time , At the same time, explore the class relationship . To enhance semantic sharing , We will set the standard DNN The original target of is expanded to the following form :

Be careful , Some of the old MTL The working assumption is that the class relationship is given explicitly , And can be used as prior knowledge [26], And our method doesn't need to do this . according to MTL Convex formula of [52], Here we are in the coefficient
On the output layer of a trace norm regular term , The class relation is extended to matrix variables
. Attention constraint
Denotes that the class relation matrix is positive semidefinite , Because it can be regarded as a similarity measure of semantic classes . coefficient λ1 and λ2 It's a regularization parameter . In the learning process , Optimal weight matrix {Wl}Ll=1 And class relation matrix Ω Simultaneous export .
3.5 Joint objectives
In order to unify the above objectives into a joint framework , We now propose a new DNN The formula , This formula explores the relationship between features and classes . In our framework , We use a layer of neurons to fuse multiple features , The aim is to bridge the gap between low-level features and high-level video semantics . At the last level of generating forecasts , We apply tracking norm regularization between different semantics , In order to better learn the prediction of multiple classes . In Mathematics , We will equation 4 Characteristic regularization and equation in 5 The classes in are regularized and merged into the following objective function :

among λ1、λ2 and λ3 It's a regularization parameter . And equation 2 Compared to the original target in , We have two trace norm regularization terms , It is used for the fusion of multiple features and the exploration of the relationship between classes . Two additional constraints tr(ψ)=1 and tr(Ω) = 1 Used to limit complexity , Such as [52] Shown . Last , The above cost function is relative to the network weight {Wl}Ll=1、 Relationship matrix between features ψ And the inter class correlation matrix Ω.
边栏推荐
- manuscript手稿格式准备
- ReentrantLock源码分析
- mysql中的索引show index from XXX每个参数的意义
- 套接字实现 TCP 通信流程
- Mysql45 lecture 01 | infrastructure: how is an SQL query executed?
- Go sends SMS based on Tencent cloud
- B+ 树的简单认识
- Problems in cross validation code of 10% discount
- 視頻分類的類間和類內關系——正則化
- Blue Bridge Cup 2015 CA provincial competition (filling the pit)
猜你喜欢

890. find and replace mode

Relation entre les classes et à l'intérieur des classes de classification vidéo - - Régularisation

字节序 - 如何判断大端小端

^34 scope interview questions

【clickhouse专栏】基础数据类型说明

Blue Bridge Cup 2015 CA provincial competition (filling the pit)

UML series articles (30) architecture modeling -- product diagram

套接字编程Udp篇

【clickhouse专栏】基础数据类型说明

Reentrantlock source code analysis
随机推荐
Golang Foundation (6)
[Blue Bridge Cup SCM 11th National race]
人类想要拥有金钱、权力、美丽、永生、幸福……但海龟只想做一只海龟
logrotate日志轮转方式create和copytruncate原理
Postman incoming list
Socket implements TCP communication flow
信号继电器RXSF1-RK271018DC110V
7-5 复数四则运算
Design of tablewithpage
Using stairs function in MATLAB
Clickhouse column basic data type description
K59. Chapter 2 installing kubernetes V1.23 based on binary packages -- cluster deployment
arm交叉编译链下载地址
890. find and replace mode
Summary of rosbridge use cases_ Chapter 26 opening multiple rosbridge service listening ports on the same server
M-arch (fanwai 10) gd32l233 evaluation -spi drive DS1302
Spark常用封装类
【藍橋杯單片機 國賽 第十一届】
postman传入list
人類想要擁有金錢、權力、美麗、永生、幸福……但海龜只想做一只海龜