当前位置:网站首页>Facebook AI & Oxford proposed a video transformer with "track attention" to perform SOTA in video action recognition tasks

Facebook AI & Oxford proposed a video transformer with "track attention" to perform SOTA in video action recognition tasks

2022-07-06 06:37:00 I love computer vision

Official account , Find out CV The beauty of Technology

  Write it at the front

In the video Transformer in , The time dimension is usually associated with two spatial dimensions (W and H) In the same way . However , Objects can be moved in the camera or in the scene , In the t The object at one position in the frame may be different from the object at t+k The content of this position in the frame is completely irrelevant . therefore , These time-dependent correspondences should be modeled , In order to understand the dynamic scene .

So , The author proposes a method for video Transformer Of Trajectory attention (trajectory attention) , It gathers information along an implicitly determined motion path . Besides , The author also proposes a new method to solve the secondary dependence of calculation and storage on input size , This is particularly important for high-resolution or long video .

The author will Attention The method is applied to Transformer In the model , And in Kinetics、Something-Something V2 and EpicKitchens In the task of video action recognition on data set SOTA Result .

 1. Thesis and code address

09803f8e8c6e0eb44279b104035a89fc.png

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

 The paper :https://arxiv.org/abs/2106.05392
 Code :https://github.com/facebookresearch/Motionformer

 2. Motivation

Transformer Has become a NLP、Vision and Speech The mainstream structure of , The self attention mechanism is suitable for different types of data and cross domain data . However , Its universality and lack of inductive bias also mean Transformer Very large amounts of data or domain specific data enhancements are often required to train , For video data , This phenomenon is particularly serious . Although video carries a wealth of time information , But they can also contain redundant spatial information from adjacent frames , The standard self attention mechanism cannot overcome this redundancy .

therefore , The author provides a variant of self attention , be called Track attention (trajectory attention) , It can better describe the time information contained in the video . Analysis of still images , Spatial locality is probably the most important inductive bias , This motivates the design of convolutional networks and in vision Transformer Spatial coding used in , because Belong to the same 3D The points of the object tend to be projected onto pixels close to each other in the image On .

It's similar in video data , But beyond that , Video data has another property :3D Points move over time , Therefore, it is projected on different parts of the image along a specific two-dimensional trajectory . Existing videos Transformer Method ignores these trajectories , Directly collect information across time dimensions or the whole three-dimensional space-time features . The author thinks that , Pool along the trajectory , It will provide a more natural inductive bias for video data , And allow the network to aggregate information from multiple views of the same object or area , To infer how an object or area moves .

In this paper , The author uses the attention mechanism to find these tracks .RAFT And other methods show that by comparing local features across space and time , Good optical flow estimation can be obtained . therefore , In this work , The author uses the attention mechanism to guide the network to collect information along the motion path .

dde150a9a86c5e5fc3a00083852bec32.png

The author also notes that , Vision Transformer It's in the image patch Running on , Therefore, it cannot be assumed that it corresponds to a single 3D spot , To move along a simple one-dimensional trajectory . The picture above depicts “ play football ” The action of , According to the specific video frame , The ball can span up to four patch. Besides , these patch Include foreground ( The ball ) Mixing with background objects , So there are at least two different movements . therefore , The trajectory attention mechanism proposed in this paper , Make the model from all relevant “ Ball area ” Assemble a motion feature in .

 3. Method

The goal of this article is to modify Transformer Attention mechanism in , To better capture the information contained in the video . Represent the input video as , Like the current video Transformer equally , Firstly, the video sequence is preprocessed into ST individual token Sequence , The spatial resolution is , The time resolution is , In this process, the author uses three-dimensional embedding Method , That is, the disjoint space-time cube features are linearly projected to .

then , The learnable location coding is added to the video embedding in spatial and temporal dimensions respectively , To get . Last , Classify a learnable token Add to token In sequence , As an overall explanation of the video .

video Transformer Structure , And ViT equally , Including layer normalization (LN) operation 、 Attention, bulls (MHA)、 Residual connection and feedforward network (MLP):

ec1e7477731b8420cfd530d127a528f2.png

3.1 Video self-attention

Self attention operation first requires a set of query-key-value vector , Every spatio-temporal location in the video st Corresponding to a vector . These vectors are obtained by linear projection of the input , namely

402 Payment Required

, The projection matrix is . Pay attention across time and space ( Joint space-time attention ) The calculation of is expressed as :

50db13e0fc07e58742606826380c5bce.png

One problem with this formula is , It has quadratic complexity in space and time , namely . Another way is to limit attention to space or time , It's called separating spatiotemporal attention :

5ff739863ff2760d295d740e52430801.png

This reduces the complexity to and , However, only models are allowed to model time and space dimensions independently .

Different from these two methods , The author pays attention along the track . For every space-time location ( The trajectory “ Reference point ”) And the corresponding query , The author constructs a set of trajectories token . The trajectory token Extend to the duration of the video sequence , Its trajectory at different times token The calculation is as follows :

ac1fd09a4d2714b70dc1460c8bfb294a.png

In this formula Attention It is applied in space , And applied to each frame independently , Then sum the dimensions in space (pooling), So only the time dimension is left , Represents the trajectory “ Reference point ” Trajectories at different times .

Once these trajectories are calculated , They need to be further pooled across time . So , The trajectory token Projected onto a new set of queries, keys and values On :

f43883c35d07cfc4c8805b1c10f1b967.png

Same as before , The updated query corresponds to the track reference point , And contains information from the spatial collection of all frames . This new query passes at a new time ( The trajectory ) Pool on the dimension to obtain one-dimensional attention :

e45292dd71fc70cf1de32b1a21fcac9e.png

Like joint space-time attention , The method in this paper has quadratic complexity in space and time , Therefore, there is no computational advantage , And it's slower than separating time and space . However , The method in this paper has better accuracy than the combined and separated spatiotemporal attention mechanism . The schematic diagram of attention trajectory in this paper is as follows :

087f7d1082d3a573b4a765ca1b648e12.png

3.2 Approximating attention

In order to get faster running speed , An approximate scheme is proposed to speed up the calculation . take query-key-value The matrix is expressed as , Each vector inside is expressed as .

In order to obtain the effective decomposition of the attention operator , The author uses a probability formula to rewrite it . Set as a classified random variable , Indicates the second input () Whether to assign to the output (), among . The attention operator uses a parametric model of the event probability based on polynomial logic function , namely softmax operator :

73f1e578fa0219981fdd5a837c782d7d.png

among , Subscript : The input tensor in the slice represents the complete tensor . Then introduce potential variables , It also indicates whether the input is assigned to the prototype , The auxiliary vector of the representation . therefore , It can be obtained by using the formula of full probability and the formula of conditional probability :

f971accb0e7e4af40501c2a142be7504.png

among , Potential variables are independent of input . Even under the parametric model , The corresponding real distribution is also difficult to deal with . therefore , The author uses an approximate parametric model to approximate the conditional probability :

288d2aad1d46c16cd0f19ee9e0613b8e.png

among , Represents the result of horizontally splicing all query vectors . therefore , Sort out all the above equations , You can get :

8c39f5ec1a4e30d9a99b2d4f2e97d1fd.png

Computational efficiency

An important feature of approximation in the equation is , It can be divided into Two step calculation . First , take value Multiply by one prototypes-keys attention matrix , It's much smaller than the full attention matrix , because .

therefore , The amount of calculation is much smaller than the original calculation . This avoids the secondary dependence of full attention on the size of input and output , As long as it remains the same , The approximate calculation uses linear complexity .

3.3 The Motionformer model

395dedb4c878a6505d9eba32661697c3.png

In this paper, the Video Transformer Build on previous work . As shown in the table above . Author use ViT Model as infrastructure , use TimeSformer Independent spatial and temporal location coding , as well as ViViT 3D image in tokenization Strategy .

 4. experiment

4.1 Ablation studies

Input: tokenization

f04be654f8af704be7d2c58d470822f6.png

The table above shows the different tokenization Experimental results of the method , It can be seen that ,Cubic In a way comparable to Square For better performance .

Input: positional encoding

5cff4526c84676a3370db68288d8cc57.png

The table above shows the different positional encoding Experimental results of the method , It can be seen that ,Separate ST In a way comparable to Joint ST For better performance .

Attention block: comparisons

38bef3debe2b109d5a570dfe856323f9.png

The table above shows the different Attention The experimental performance of the module is compared , It can be seen that the trajectory attention module in this paper can achieve better performance than other attention modules .

4.2 Orthoformer approximated attention

Approximation comparisons

ef824be61cda6d593401ff5050665d4e.png

The table above shows Orthoformer Approximate calculation of trajectory , Can occupy less video memory than other methods .

Prototype selection

4645f55196dba88087f6defe9df1734f.png

Orthoformer It's the prototype selection process , The above table shows the experimental results of different prototype selection strategies .

Number of prototypes

3930eb2b033eec11a681e086741a88c3.png

The above table shows the experimental results of different prototype numbers .

Temporally-shared prototypes

b1c523a2c5ecc682af2a58803df003dd.png

The above table shows the results of whether the prototype is shared , It can be seen that , Sharing prototypes can not only reduce video memory , It also improves performance .

4.3 Comparison to the state-of-the-art

b9ecb7207f6112314d8fcfa52dec1a9d.png

The above table shows the application of this method on multiple data sets and others SOTA The comparison results of the methods are , It can be seen that , This paper implements... On multiple data sets SOTA Performance of .

 5. summary

In this paper , The author puts forward a kind of General video data attention block , It gathers information along an implicitly determined trajectory , It provides a realistic inductive bias for the model . Besides , The author uses a kind of Pay attention to the approximation algorithm It further solves the problem of its secondary dependence on the input size , The algorithm significantly reduces the demand for video memory . Through these designs , The authors have obtained... On several benchmark data sets SOTA Result .

For all that , The method in this paper inherits Transformer Many limitations of the model , Training speed and poor efficiency . The trajectory attention proposed in this paper has higher computational complexity than the recent work , Although the approximation algorithm weakens this point , Significantly reduce the amount of video memory and Computing , However, this step cannot be done in parallel , Make it run very slowly .

▊  Author's brief introduction

research field :FightingCV Official account operator , The research direction is multimodal content understanding , Focus on solving the task of combining visual modality and language modality , promote Vision-Language Field application of the model .

You know / official account :FightingCV

bf51717cd7f267d18a3ba02146d2cb9c.png

END

Welcome to join 「Transformer Exchange group notes :TFM

2d628e0b54845577505c574a9ff27514.png

原网站

版权声明
本文为[I love computer vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202132018243008.html