当前位置：网站首页>Towhee weekly model

Towhee weekly model

2022-07-23 14:32:00 【Zilliz Planet】

Weekly newspaper producer ：Towhee Technical team

This week we share 5 Video related AI Model ：

Portable and easy-to-use video action recognition model series MoViNets、 Realize the cross modal search of text and video CLIP4Clip、 Than CLIP4Clip Better video retrieval model DRL、 Break away from the limitations of video data Frozen in Time、 To champion model MMT Upgraded again MDMMT.

If you think the content we share is good , Please don't be stingy and give us some free encouragement ： give the thumbs-up 、 like 、 Or share it with your friends .

MoViNets Series model , A good helper for real-time classified videos on mobile phones

Need video understanding , But the model is too heavy 、 It takes too long ？ The lightweight motion recognition model is upgraded again , By Google research in 2021 Put forward in MoViNets Series can infer streaming video more efficiently , And support the implementation of classification of video streams captured by mobile devices .MoViNets General data set for video action recognition Kinetics、Moments in Tme and Charades Advanced accuracy and efficiency have been obtained on , Proved its high efficiency and wide applicability .

MoViNets: Streaming Evaluation vs. Multi-Clip Evaluation

MoViNets It is a series of Convolutional Neural Networks , Yes 2D Video classifiers and 3D Video classifiers learn from each other , The key advantages of being compatible with them , And alleviate their respective limitations . This series of models obtain rich and efficient video network structure through neural structure search , Reference stream buffering technology makes 3D Convolution can accept any length of streaming video sequence , Then simply integrate multiple models to improve accuracy , Finally, the calculation amount is effectively balanced 、 Memory overhead 、 precision .

Related information ：

Model use case ：action-classification/movinet
The paper ：MoViNets: Mobile Video Networks for Efficient Video Recognition
More information ：MoViNets： Make real-time video understanding a reality

Multimodal model CLIP4Clip Take you to realize mutual search between text and video

CLIP4Clip Cross modal graphic model CLIP Based on , Successfully realized the text / Video retrieval task . Whether it's by text looking for relevant content videos , Or automatically match the most appropriate description for the video ,CLIP4Clip Can help you do . Through a large number of ablation experiments ,CLIP4Clip Proved its effectiveness , And in MSR-VTT、MSVC、LSMDC、ActivityNet and DiDeMo Wait for the text - On the video data set SoTA result .

CLIP4Clip: Main Structure

CLIP4Clip Based on the pre trained graphic model , Complete the task of video retrieval through migration learning or fine-tuning . It uses pre trained CLIP Model as backbone network , It solves the task of video clip retrieval from frame level input , And uses the parameterless type 、 Sequence type and close type similarity calculator to get the final result .

Related information ：

Model use case : video-text-embedding/clip4clip
The paper : CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
More information ： CLIP4Clip: CLIP Next city , utilize CLIP Realize video retrieval

Have better text video interaction ,DRL Separation framework improvements CLIP4Clip

Even though CLIP4Clip Realize cross modal text / Video Retrieval , However, the network structure still has some limitations or room for improvement . therefore 2022 Beginning of the year , Then there is DRL（Disentangled Representation Learning） Match content of different granularity across modes . In the task of video retrieval , The improved model greatly improves the accuracy of the major text video data sets .

Overview of DRL for Text-Video Retrieval

CLIP4Clip When calculating the similarity between text and video , Only the overall representation of the two modes is considered , Lack of fine-grained interaction . such as , When the text description only corresponds to a part of the video frame , If we extract the overall features of the video , Then the model may be disturbed and misled by the information of other video frames .DRL Yes CLIP4Clip Two important improvements are proposed , One is Weighted Token-wise Interaction, Dense prediction of similarity , adopt max The operation finds potentially active token. The other is Channel Decorrelation Regularization, Channel decorrelation regularization can reduce the redundancy and competition of information between channels , Use the covariance matrix to measure the redundancy on the channel .

Related information :

Model use case : video-text-embedding/drl
The paper : Disentangled Representation Learning for Text-Video Retrieval
More information ： Video multimodal pre training / Retrieval model

Treat images as video snapshots ,Frozen in Time Break away from the data limitations of multimodal video retrieval

Oxford University is in ICCV2021 Published Frozen in Time, Use text flexibly / Images and text / Video data sets , It provides an end-to-end video image joint encoder . The model is for the latest ViT and Timesformer Modification and extension of structure , And include attention in space and time .

Frozen in Time: Joint Image and Video Training

Frozen in Time It can be trained by using text images and text video data sets alone or in combination . When using image training , The model treats it as a frozen snapshot of the video , Gradually learn the context of time level in training . Besides , The author also provides a new video text pre training data set WebVid-2M , contain 200 More than 10000 videos . Although the amount of training is an order of magnitude smaller than other general data sets , But experiments have shown that , Use the data set to pre train the model in the standard downstream video retrieval benchmark ( Include MSR-VTT、MSVD、DiDeMo、LSMDC) Can produce SOTA Result .

Related information ：

from MMT To MDMMT, Comprehensively optimize text and video retrieval

MDMMT Published in 2021 year , Yes, the year before last cvpr Video pentathlon challenge champion MMT ( Published in ECCV 2020) An extended study of . This research attempts and optimizes the training data set , Continue to lead the text video retrieval race .

MMT: Cross-modal Framework

MMT For extraction 、 Integrate video features , Including image features 、 Phonetic features and phonetic corresponding text features . Firstly, the pre trained expert network is used to extract features for the processing of the three modes , Then for each modal feature , And use maxpool Generate an integrated feature . The integrated feature and the corresponding modal feature sequence are spliced , Then the features of different modal groups are spliced . It will also learn a corresponding modal mark feature insertion for each mode , And the corresponding different frame feature insertion . That is to add modal information and frame number information to each feature .MDMMT Use with MMT Same loss function and similar structure , But it is optimized on the super parameter .

Related information :

Model use case ： video-text-embedding/mdmmt
The paper : MDMMT: Multidomain Multimodal Transformer for Video Retrieval Multi-modal Transformer for Video Retrieval
More information ： Video multimodal pre training / Retrieval model

For more project updates and details, please pay attention to our project ( https://github.com/towhee-io/towhee/blob/main/towhee/models/README_CN.md) , Your attention is a powerful driving force for us to generate electricity with love , welcome star, fork, slack Three even :)

zilliz User communication

原网站

版权声明
本文为[Zilliz Planet]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207230804297864.html

当前位置：网站首页>Towhee weekly model

Towhee weekly model

MoViNets Series model , A good helper for real-time classified videos on mobile phones

Multimodal model CLIP4Clip Take you to realize mutual search between text and video

Have better text video interaction ,DRL Separation framework improvements CLIP4Clip

Treat images as video snapshots ,Frozen in Time Break away from the data limitations of multimodal video retrieval

from MMT To MDMMT, Comprehensively optimize text and video retrieval

边栏推荐

猜你喜欢

随机推荐