当前位置:网站首页>[deep learning] video classification technology sorting
[deep learning] video classification technology sorting
2022-07-27 20:36:00 【Demeanor 78】

Recently, I am doing multimodal video classification , This paper sorts out the technology of video classification , Share with you .

In traditional image classification tasks , Generally, the input is a HxWxC Two dimensional image of , After convolution and other operations, the category probability is output .

For video tasks , The input is a sequence of two-dimensional images with temporal relationship, which is composed of continuous two-dimensional image frames , Or as a TxHxWxC Four dimensional tensor of .
Such data has the following characteristics :
1. The change information between frames can often reflect the content of video ;
2. The change between adjacent frames is generally small , There is a lot of redundancy .

The most intuitive and simplest way is to use the static image classification method , Treat each video frame directly as an independent two-dimensional image , Using neural network to extract the features of each video frame , Average the eigenvectors of all frames of this video to get the eigenvectors of the whole video , Then classify and recognize , Or get a prediction result directly for each frame , Finally, a consensus is reached in all frame results .
The advantage of this approach is that the amount of calculation is very small , The computational cost is similar to that of general two-dimensional image classification , And the implementation is very simple . But this approach does not consider the relationship between frames , The methods of average pooling or consensus are relatively simple , A lot of information will be lost . The video duration is very short , It is a good method when the change between frames is relatively small .

Another scheme also uses the method of static image classification , Treat each video frame directly as an independent two-dimensional image , But use in fusion VLAD The way .VLAD yes 2010 An algorithm for image retrieval in large-scale image database proposed in , It can put a NxD The characteristic matrix of is transformed into KxD(K<d). In this way , We cluster each frame feature of a video to get multiple clustering centers , Assign all features to the designated cluster center , Calculate the eigenvectors in each clustering region separately , Final concat Or weighted sum the eigenvectors of all clustering regions as the eigenvectors of the whole video .

In the original VLAD in ak Item is a non derivable item , At the same time, clustering is also a non derivative operation , We use NetVLAD Improvement , Calculate the distance between each eigenvector and all clustering centers softmax Obtain the probability of the nearest clustering center of the eigenvector , The clustering center is determined by learnable parameters .
Compared with average pooling ,NetVLAD Video sequence features can be transformed into multiple video shot features through the clustering center , Then the global feature vector is obtained by weighted summation of multiple video shots through the weight that can be learned . But in this way, the eigenvector of each frame is still calculated independently , The timing relationship and change information between frames cannot be considered .

For the sequence with temporal relation , Use RNN It's a common practice . The specific method is to use the network ( It's usually CNN) Extract each video frame sequence as a feature sequence , Then input the feature sequence in chronological order, such as LSTM Of RNN in , With RNN The final output of is classified output .
This approach can consider the timing relationship between frames , Theoretically, it has better effect . But in actual experiments , This solution has no obvious advantages over the first solution , This may be because RNN There is a forgetting problem for long sequences , The effect of short sequence video using simple static method is good enough .

Double flow method is a better method . This method uses two network branches, one of which extracts the feature vector of the video frame for the image Branch , The other is optical flow Branch , Using the optical flow graph between multiple frames to extract optical flow features , The fusion of image branch and optical flow branch feature vectors is used for classification and prediction .
This method is static , The amount of calculation is relatively small , The change information between frames can also be extracted from optical flow . But the calculation of optical flow will introduce additional overhead .

We can also directly transform the traditional two-dimensional convolution kernel into a three-dimensional convolution kernel , Treat the input image sequence as a four-dimensional tensor .
Representative practices include C3D,I3D,Slow-Fast etc. .
The experimental effect of this method is better , However, the complexity of three-dimensional convolution is increased by an order of magnitude , It often requires a lot of data to achieve good results , When the data is insufficient, the effect may be poor or even the training may fail .
Based on this , A series of methods are also proposed to simplify the computation of three-dimensional convolution , For example, based on low rank approximation P3D etc. .


TimeSformer stay ViT On the basis of , Five different attention calculation methods are proposed , Achieve computational complexity and Attention In the field of vision trade-off:
1. Spatial attention mechanism (S): Only the image blocks in the same frame are selected for self attention mechanism ;
2. Space time common attention mechanism (ST): Take all image blocks in all frames for attention mechanism ;
3. Separate spatiotemporal attention mechanisms (T+S): First, all the image blocks in the same frame are given self attention mechanism , Then the attention mechanism is applied to the image blocks at corresponding positions in different frames ;
4. Sparse local global attention mechanism (L+G): Use all the frames first , The adjacent H/2 and W/2 The image block calculates the local attention , And then in space , Use 2 The step size of each image block , The self attention mechanism is calculated in the whole sequence , This can be seen as a faster approximation of global spatiotemporal attention ;
5. The axial attention mechanism (T+W+H): First, the self attention mechanism is analyzed in the time dimension , Then the self attention mechanism is implemented on the image block with the same ordinate , Finally, the self attention mechanism is implemented on the image block with the same abscissa .
After the experiment , The author found T+S The effect of attention style is the best , Greatly reduced Attention At the same time , The effect is even better than all calculations patch Of ST attention .

Video Swin Transformer yes Swin Transformer 3D expanded version of , This method simply expands the window of computational attention from two dimensions to three dimensions , In the experiment, quite good results have been achieved .
In our experiment ,Video Swin Transformer It is a relatively good video understanding at present backbone, The experimental results are significantly better than the above methods .
The full version of the source file can be clicked Read the original obtain .

Past highlights
It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download machine learning and deep learning notes and other information printing 《 Statistical learning method 》 Code reproduction album machine learning communication qq Group 955171419, Please scan the code to join wechat group 
边栏推荐
- Unity fairygui play video (Lua)
- In 2019, the global semiconductor market revenue was $418.3 billion, a year-on-year decrease of 11.9%
- Injection attack
- Get wechat product details API
- Datepicker date selector in viewui compatible solution in ie11 browser
- Illustration leetcode - 592. Fraction addition and subtraction (difficulty: medium)
- 学习Blender必备的12款动画插件,来了解一下
- 盘点下互联网大厂的实习薪资:有了它,你也可以进厂
- Redis queue、rdb学习
- JD: search product API by keyword
猜你喜欢

'vite' is not an internal or external command, nor is it a runnable program or batch file

A new UI testing method: visual perception test

Wu Hequan: digital technology empowering "double carbon" practice according to local conditions

Add joint control to gltf model

Check the internship salary of Internet companies: with it, you can also enter the factory

办公自动化解决方案——DocuWare Cloud 将应用程序和流程迁移到云端的完整的解决方案

Mlx90640 infrared thermal imager temperature sensor module development notes (VII)

Idea: solve the problem of code without prompt

PyQt5快速开发与实战 4.3 QLabel and 4.4 文本框类控件

Datepicker date selector in viewui compatible solution in ie11 browser
随机推荐
Built in function lock correlation
Standing on the shoulders of giants to learn, jd.com's popular architect growth manual was launched
Konka semiconductor's first storage master chip was mass produced and shipped, with the first batch of 100000 chips
Slf4j introduction
What app should individuals use for stock speculation to be safer and faster
Huawei's mobile phone shipments exceed Apple's, ranking second in the world, but it faces a large amount of inventory that needs to be cleaned up
Koin simple to use
【Map 集合】
Assignment 1 - Hello World ! - Simple thread Creation
Idea: solve the problem of code without prompt
I'm also drunk. Eureka delayed registration and this pit
C language -- array
我也是醉了,Eureka 延迟注册还有这个坑
Redis 事物学习
MySQL 日志错误日志
antdv: Each record in table should have a unique `key` prop,or set `rowKey` to an unique primary key
Datepicker date selector in viewui compatible solution in ie11 browser
ES6 -- Deconstruction assignment
Solve the problem of displaying the scroll bar when there is no data in the viewui table
Pyqt5 rapid development and practice 4.3 qlabel and 4.4 text box controls