当前位置:网站首页>Towhee weekly model
Towhee weekly model
2022-07-27 04:03:00 【Zilliz】
Weekly newspaper producer :Towhee Technical team
This week we share 5 Video related AI Model :
Portable and easy-to-use video action recognition model series MoViNets、 Realize the cross modal search of text and video CLIP4Clip、 Than CLIP4Clip Better video retrieval model DRL、 Break away from the limitations of video data Frozen in Time、 To champion model MMT Upgraded again MDMMT.
If you think the content we share is good , Please don't be stingy and give us some free encouragement : give the thumbs-up 、 like 、 Or share it with your friends .
Project direction
https://github.com/towhee-io/towhee/blob/main/towhee/models/README_CN.md
MoViNets Series model , A good helper for real-time classified videos on mobile phones
Need video understanding , But the model is too heavy 、 It takes too long ? The lightweight motion recognition model is upgraded again , By Google research in 2021 Put forward in MoViNets Series can infer streaming video more efficiently , And support the implementation of classification of video streams captured by mobile devices .MoViNets General data set for video action recognition Kinetics、Moments in Tme and Charades Advanced accuracy and efficiency have been obtained on , Proved its high efficiency and wide applicability .

MoViNets¹ It is a series of Convolutional Neural Networks , Yes 2D Video classifiers and 3D Video classifiers learn from each other , The key advantages of being compatible with them , And alleviate their respective limitations . This series of models obtain rich and efficient video network structure through neural structure search , Reference stream buffering technology makes 3D Convolution can accept any length of streaming video sequence , Then simply integrate multiple models to improve accuracy , Finally, the calculation amount is effectively balanced 、 Memory overhead 、 precision .
Related information :
Model use case :https://towhee.io/action-classification/movinet The paper :MoViNets: Mobile Video Networks for Efficient Video Recognition (https://arxiv.org/pdf/2103.11511.pdf) More information :MoViNets: Make real-time video understanding a reality (https://zhuanlan.zhihu.com/p/495761037)
Multimodal model CLIP4Clip Take you to realize mutual search between text and video
CLIP4Clip² Cross modal graphic model CLIP Based on , Successfully realized the text / Video retrieval task . Whether it's by text looking for relevant content videos , Or automatically match the most appropriate description for the video ,CLIP4Clip Can help you do . Through a large number of ablation experiments ,CLIP4Clip Proved its effectiveness , And in MSR-VTT、MSVC、LSMDC、ActivityNet and DiDeMo Wait for the text - On the video data set SoTA result .

CLIP4Clip Based on the pre trained graphic model , Complete the task of video retrieval through migration learning or fine-tuning . It uses pre trained CLIP Model as backbone network , It solves the task of video clip retrieval from frame level input , And uses the parameterless type 、 Sequence type and close type similarity calculator to get the final result .
Related information :
Model use case :https://towhee.io/video-text-embedding/clip4clip The paper :CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval (https://arxiv.org/pdf/2104.08860.pdf) More information :CLIP4Clip: CLIP Next city , utilize CLIP Realize video retrieval (https://zhuanlan.zhihu.com/p/443165620)
Have better text video interaction ,DRL Separation framework improvements CLIP4Clip
Even though CLIP4Clip Realize cross modal text / Video Retrieval , However, the network structure still has some limitations or room for improvement . therefore 2022 Beginning of the year , Then there is DRL³(Disentangled Representation Learning) Match content of different granularity across modes . In the task of video retrieval , The improved model greatly improves the accuracy of the major text video data sets .

CLIP4Clip When calculating the similarity between text and video , Only the overall representation of the two modes is considered , Lack of fine-grained interaction . such as , When the text description only corresponds to a part of the video frame , If we extract the overall features of the video , Then the model may be disturbed and misled by the information of other video frames .DRL Yes CLIP4Clip Two important improvements are proposed , One is Weighted Token-wise Interaction, Dense prediction of similarity , adopt max The operation finds potentially active token. The other is Channel Decorrelation Regularization, Channel decorrelation regularization can reduce the redundancy and competition of information between channels , Use the covariance matrix to measure the redundancy on the channel .
Related information :
Model use case : https://towhee.io/video-text-embedding/drl The paper :Disentangled Representation Learning for Text-Video Retrieval (https://arxiv.org/pdf/2203.07111.pdf) More information : Video multimodal pre training / Retrieval model (https://zhuanlan.zhihu.com/p/515175476)
Treat images as video snapshots ,Frozen in Time Break away from the data limitations of multimodal video retrieval
Oxford University is in ICCV2021 Published Frozen in Time⁴, Use text flexibly / Images and text / Video data sets , It provides an end-to-end video image joint encoder . The model is for the latest ViT and Timesformer Modification and extension of structure , And include attention in space and time .

Frozen in Time It can be trained by using text images and text video data sets alone or in combination . When using image training , The model treats it as a frozen snapshot of the video , Gradually learn the context of time level in training . Besides , The author also provides a new video text pre training data set WebVid-2M , contain 200 More than 10000 videos . Although the amount of training is an order of magnitude smaller than other general data sets , But experiments have shown that , Use the data set to pre train the model in the standard downstream video retrieval benchmark ( Include MSR-VTT、MSVD、DiDeMo、LSMDC) Can produce SOTA Result .
Related information :
Model use case : https://towhee.io/video-text-embedding/frozen-in-time The paper : Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (https://arxiv.org/pdf/2104.00650.pdf) More information :ICCV2021-《Frozen in Time》- Oxford University's new pre training video text dataset WebVid-2M, Joint video and image coders designed for end-to-end retrieval ! The code is open source !(https://zhuanlan.zhihu.com/p/441807006)
from MMT To MDMMT, Comprehensively optimize text and video retrieval
MDMMT⁵ Published in 2021 year , Yes, the year before last cvpr Video pentathlon challenge champion MMT ( Published in ECCV 2020) An extended study of . This research attempts and optimizes the training data set , Continue to lead the text video retrieval race .

MMT For extraction 、 Integrate video features , Including image features 、 Phonetic features and phonetic corresponding text features . Firstly, the pre trained expert network is used to extract features for the processing of the three modes , Then for each modal feature , And use maxpool Generate an integrated feature . The integrated feature and the corresponding modal feature sequence are spliced , Then the features of different modal groups are spliced . It will also learn a corresponding modal mark feature insertion for each mode , And the corresponding different frame feature insertion . That is to add modal information and frame number information to each feature .MDMMT Use with MMT Same loss function and similar structure , But it is optimized on the super parameter .
Related information :
Model use case :https://towhee.io/video-text-embedding/mdmmt The paper :MDMMT: Multidomain Multimodal Transformer for Video Retrieval (https://arxiv.org/pdf/2103.10699.pdf);Multi-modal Transformer for Video Retrieval(https://arxiv.org/pdf/2007.10639.pdf) More information : Video multimodal pre training / Retrieval model (https://zhuanlan.zhihu.com/p/515175476)
For more project updates and details, please pay attention to our project (https://github.com/towhee-io/towhee/blob/main/towhee/models/README_CN.md) , Your attention is a powerful driving force for us to generate electricity with love , welcome star, fork, slack Three even :)

This article is from WeChat official account. - ZILLIZ(Zilliztech).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .
边栏推荐
- Smart pointer shared_ ptr、unique_ ptr、weak_ ptr
- 阿里云服务器域名加端口网页不能访问问题记录
- [Android synopsis] kotlin multithreaded programming (I)
- C# 使用SqlSugar Updateable系统报错无效数字,如何解决?求指导!
- C语言学习笔记 —— 内存管理
- Share the current life -- a six week internship experience of a high school graduate in CCTV
- 【OBS】circlebuf
- 【愚公系列】2022年7月 Go教学课程 018-分支结构之switch
- Briefly sort out the dualpivotquicksort
- mysql中case when返回多个字段处理方案
猜你喜欢
随机推荐
A. Parkway Walk
SkyWalking分布式系统应用程序性能监控工具-中
Leetcode- > 2-point search and clock in (3)
Prime factorization -- C (GCC) -- PTA
Characteristics and experimental suggestions of abbkine abfluor 488 cell apoptosis detection kit
Review in the sixth week
On the first day of Shenzhen furniture exhibition, the three highlights of Jin Ke'er booth were unlocked!
Director of meta quest content ecology talks about the original intention of APP lab design
Ming min investment Qiu Huiming: behind the long-term excellence and excess, the test is the team's investment and research ability and the integrity of strategy
ApacheCon Asia 预热直播之孵化器主题全回顾
Interview question: the difference between three instantiated objects in string class
The job created by flinksqlclient will disappear after the restart of Flink. Is there any way?
DataX无法连接对应的数据库(windows下可以,linux下失败)
Application, addition and deletion of B-tree
Cocos game practice-05-npc and character attack logic
C语言力扣第43题之字符串相乘。优化竖式
About the solution of using hyperbeach to appear /bin/sh: 1: packr2: not found
Briefly sort out the dualpivotquicksort
Startup process and rescue mode
Binary tree (Beijing University of Posts and Telecommunications machine test questions) (day85)









