当前位置:网站首页>"Video version Mae" of hekaiming team, efficient video pre training! The effect is also very good when mask ratio is up to 90

"Video version Mae" of hekaiming team, efficient video pre training! The effect is also very good when mask ratio is up to 90

2022-06-11 19:39:00 I love computer vision

Official account , Find out CV The beauty of Technology

This article shares papers 『Masked Autoencoders As Spatiotemporal Learners』, He Kaiming's team proposed a video version of MAE, Conduct efficient video pre training !Mask Ratio the height is 90% The effect is very good !

The details are as follows :

da3ce3ad3eef4f431ba6655b980a2b5e.png

  • Thesis link :https://arxiv.org/abs/2205.09113

  • Project links : Not yet open source

      01      

Abstract


This paper studies Masked Autoencoders(MAE) Conceptually, it is a simple extension of video spatio-temporal representation learning . The author randomly mask Time and space in video patch, And learn Autoencoders Reconstruct them in pixels .


Interestingly , In this paper, the MAE Methods can learn strong representations , There is little spatiotemporal induced bias , Time and space are unknowable and random mask Perform best . The author observed , Optimal masking rate (mask ratio) the height is 90%( The masking rate of the image is 75%), This supports the assumption that this ratio is related to data information redundancy . A higher masking rate will result in a larger acceleration ratio . Author use vanilla Vision Transformers The competitive results of several challenging video data sets are reported .


Through the experiment , The author observed ,MAE Your performance is much better than supervised pre training . Besides , The authors also report on untreated in the real world Instagram The results of training on data . The research in this paper shows that ,masked autoencoding The general framework of (BERT、MAE etc. ) It can be a unified method of representation learning using the least domain knowledge .



      02      

Motivation

Deep learning communities are experiencing a trend , That is, unified methods to solve problems in different fields , Like language 、 Vision 、 Words, etc . In terms of Architecture ,transformer Computer vision has been successfully introduced , It is established as a common building block of language and vision . For self supervised representational learning ,BERT Denoising in / Mask automatic coding (masked autoencoding) The method has been proved to be effective in learning visual representation from images . In order to unify the method , Only a small amount of domain knowledge is introduced for specific problems , This enables the model to learn useful knowledge almost entirely from the data .

9981b0aca017e6c522ad2f9dc719d01a.png

Follow this philosophy , The author's research will MAE Extended to spatiotemporal representation learning problems . The method in this paper is very simple : The author randomly blocks the space-time in the video patch, And learn the automatic encoder to reconstruct them ( Pictured above ). The method in this paper has the smallest domain knowledge : The only spatiotemporal specific inductive bias is embedding patch And its location ; All other components are agnostic to the spatiotemporal nature of the problem .

especially , The encoder and decoder in this paper are ordinary visual Transformer, No decomposition or hierarchy , The random of this paper mask Sampling is unknowable to the spatiotemporal structure . The method in this paper predicts the pixel value , And don't use additional problem specific tokenizer. In short , The method in this paper is simply applied to space-time patch Set . Although the inductive deviation is the smallest , However, the method of this paper has achieved strong empirical results , It shows that useful knowledge can be learned from the data .

a05cadde62f072103867c013a4920f00.png

MAE The literature assumes that , Masking rate in masking automatic coding method ( Remove token Percent of ) Related to the information redundancy of the problem . for example , Natural images have more information redundancy than language , Therefore, the optimal masking rate is higher . The observation of video data in this paper supports this hypothesis . The author found , The video MAE The best masking rate is 90%( As shown in the figure above ), Higher than that of the corresponding image 75% Masking rate . This can be understood as the result of time correlation of natural video data . The extreme case is , If a video has T Same static frame , For all time and space patch Conduct 1/T Random sampling of will show most of the static frames . Because in natural video , Slow motion is more likely to happen than fast motion , So according to the experimental observation , The masking rate can be very high .

The higher the masking rate , The more effective the practical solution .MAE Visible only to token After applying the encoder ,90% The masking rate reduces the encoder time and memory complexity to <1/10. Combined with a small decoder ,MAE Pre training and coding all token comparison , Theoretically, it can reduce 7.7 Times the amount of calculation . in fact , The amount of calculation is so large that the data loading time has become a new bottleneck ; even so , The author still recorded 4.1 Times wall-clock Speed up . Such significant acceleration is very important for large-scale and time-consuming video research .

The authors report strong results on various video recognition data sets .MAE Pre training greatly improves the generalization performance : stay Kinetics-400 On , Compared to training from scratch , It will ViT-Large Has improved the accuracy of 13%, And in general , It requires less wall-clock Training time ( Pre training plus fine tuning ). In this paper, the MAE Pre training can greatly surpass the pre training opponents it supervises . By using vanilla ViT, The method of this paper is the same as that of using more domain knowledge before SOTA Compared with , Achieved competitive results . The authors also report the use of MAE Yes 100 Ten thousand random 、 Untreated Instagram Video pre training results . These results suggest that , In a unified framework , The self supervised learning of video can be carried out in a way similar to language and image .


      03      

Method

The method of this paper is MAE Simple expansion of spatiotemporal data , The goal is to develop the method under a common and unified framework , Use domain knowledge as little as possible .

Patch embedding

According to the original ViT, Given a video clip , The author divides it into a regular grid , It includes non overlapping in space-time patch.patch Linear spreading is performed by embedding and projection . Add location embedding to embedding patch in .patch And location embedding is the only process with spatiotemporal perception .

Masking

b83e9d2b08b4ed12159b3edfdea379a9.png

The author starts from the embedded patch Centralized random extraction patch. This random sampling has nothing to do with the temporal and spatial structure , Pictured above a. This structure agnostic sampling strategy is similar to 1D Medium BERT and 2D Medium MAE.

MAE It is assumed that the optimal masking rate is related to the information redundancy of the data . For unstructured random masking ,BERT Use of language 15% Masking rate , and MAE Used... For images 75% Masking rate , This shows that images have more information redundancy than languages . The empirical results of video in this paper support this hypothesis . The author observed that the best masking rate of video is 90%. This is consistent with the usual assumption , That is, due to time coherence , Natural video has more information redundancy than image . The following figure shows that the masking rate of this method is 90% and 95% On unknown validation data MAE Reconstruction results .

Spatiotemporal agnostic sampling can be more effective than structure aware sampling strategy . Pictured above b and c Shown , Space only or time only sampling may retain less information , And produce very difficult pre training tasks . for example , The masking rate is 87.5% Of 8 Only time sampling of frames means that only one frame is retained , This presents a very challenging task , That is, the future and the past are predicted only in the case of a given frame . The author observed , The optimal masking ratio of structure aware sampling is usually low . by comparison , Spatiotemporal agnostic sampling makes better use of a limited number of visible patch, Therefore, a higher masking rate is allowed .

90ab3892c3aa818d883544bb7f96a1cf.png

Autoencoding

Our encoder is an ordinary ViT, Applies only to visible embeddings patch Set . This design greatly reduces time and memory complexity , And brings more practical solutions .90% The masking rate reduces the encoder complexity to <1/10. The decoder in this paper is another one based on coding patch Set and set mask token Joint ordinary ViT. Decoder specific location embedding is added to this collection . Because the decoder is designed to be smaller than the encoder , So although the decoder processes the entire set , But its complexity is less than that of encoder . In the default settings for this article , Compared with complete coding , Whole autoencoder The complexity of is reduced 7.7 times .

The decoder predicts the number of pixels in the pixel space patch. In principle, , It can simply predict a complete space-time patch( for example ,t×16×16); In the experiment , The author found that the prediction patch A single time slice (16×16) Is enough , In this way, the size of the prediction layer can be kept controllable . This paper predicts the original pixel or each of them patch The normalized value of . The training loss function is the mean square error between the prediction and its target (MSE), In the unknown patch Average above . The encoder and decoder are not aware of the temporal and spatial structure of the problem . And SOTA Compared with the structure , The model in this paper has no hierarchy or spatio-temporal decomposition , Only rely on global self attention , Learn useful knowledge from data .


      04      

experiment

Performance

cc5706583152d8320cfd93a59f362c5b.png

The figure above shows the standard used ViT-L take MAE Pre training and no pre training ( That is, training from scratch ) The result of the comparison . by comparison , Use MAE Preliminary training 800 individual epoch, identical ViT-L achieve 84.4% The accuracy of , Compared to training from scratch , The absolute value increases significantly 13.0%. This gap is much larger than that of image recognition task (∼ 3%), indicate MAE Pre training is more helpful for video recognition .

In addition to accuracy gain ,MAE Pre training can also reduce the overall training cost ,800 epoch MAE Pre training only needs 35.8 Hours . Because of pre training , need 16.3 A short fine-tuning of hours , Good accuracy can be obtained . The overall training time can be shorter than the training from scratch . This shows that MAE It is a practical video recognition solution .

Ablation experiments

ff12b586219ddf2e06f7954becc92133.png

The figure above shows the combined effect of masking rate and pre training cycle .90% The ratio effect is the best .95% The ratio is surprisingly good , If you train long enough , This can catch up with . The higher masking rate leads to the error of encoder coding token Less ; For a more comprehensive view of , The author draws the code token The impact of total number and accuracy ( Top right ). Under this measure ,90% and 95% The ratio is close .

554f72491ab138560dd5e150f71d1c23.png

The table above shows the different mask Experimental results of the strategy , It can be seen that the effect of random sampling is the best .

c422bc1232885eb7b3e0d194a02903ad.png

The figure shows the experimental results of different reconstruction targets .

86f39e10f50f5d90659a40802348476a.png

The figure above shows the experimental results with different data enhancement .

860f5f2653e6f5524bfb9fe52a967383.png

Because the calculation speed of this method is fast , Repeated sampling is needed to reduce data loading overhead . The above table reports its impact . Reuse 2 To 4 Time can be wall-clock Speed up 1.8 Times or 3.0 times , Because the loaded and decompressed files can be reused many times .

99886a808ac3f09ca6dd0ae99728d26f.png

The table above shows Decoder The influence of depth and width .

d73b3d7e611cbfa7c44285e8b8fde26e.png

The table above studies the pre training of different data sets , And migrate it to various downstream tasks .

6955d47231224d522c37cc1c61b9097c.png

The table above shows the for MAE The reality of pre training Instagram data . For each group MAE Conduct 200、400 and 800 individual epoch Pre training , And compare them. K400 Fine tuning accuracy on . The model is ViT-L.


      05      

summary

The author explores MAE Simple expansion of video data , Some interesting observations have been made :(i) It is possible to learn strong representations with minimal domain knowledge or inductive bias . This is in line with ViT Working idea. And BERT and MAE similar , Self supervised learning on video can be solved in the framework of conceptual unity .(ii) The experiments in this paper show that , Masking rate is an important factor in general masking automatic coding methods , The best value may depend on the nature of the data ( Language 、 Images 、 Video etc. ).(iii) The author reports on the real world 、 Results of pre training without evaluated data .

Despite these observations , But there are still some outstanding issues . The data scale studied in this paper is several orders of magnitude smaller than the data scale corresponding to the language . Although the method in this paper improves the efficiency of self supervised learning to a great extent , However, high-dimensional video data is still the main challenge of expansion .

Reference material

[1]https://arxiv.org/abs/2205.09113

199b659d35b38a051b3031e83e1de7ef.png

END

Join in 「 Computer vision Exchange group notes :CV

bd158cd6052ae690e84f6954a5fcc87d.png

原网站

版权声明
本文为[I love computer vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111925215553.html

随机推荐